Time Series Analysis 


James D. Hamilton 


PRINCETON UNIVERSITY PRESS 
PRINCETON, NEW JERSEY 


Copyright © 1994 by Princeton University Press 


Published by Princeton University Press, 41 William St., 


Princeton, New Jersey 08540 
In the United Kingdom: Princeton University Press, 
Chichester, West Sussex 


All Rights Reserved 


Library of Congress Cataloging-in-Publication Data 
Hamilton, James D. (James Douglas), (1954-) 
Time series analysis / James D, Hamilton. 
. om. 
Includes bibliographical references and indexes. 
ISBN 0-691-04289-6 
1. Time-series analysis. I. Title. 
QA280.H264 1994 
519,5'5—dc20 93-4958 
CIP 


This book has been composed in Times Roman. 


Princeton University Press books are printed on acid-free paper and meet the guidelines for 
permanence and durability of the Committee on Production Guidelines for Book Longevity of the 


Council on Library Resources. 


Printed in the United States of America 


100987654321 


Contents 


PREFACE xiii 


1 Difference Equations 


1.1. First-Order Difference Equations 1 
1.2. pth-Order Difference Equations 7 


APPENDIX 1.A. Proofs of Chapter 1 Propositions 21 
References 24 


2 Lag Operators 


2.1. Introduction 25 

2.2. First-Order Difference Equations 27 

2.3. Second-Order Difference Equations 29 

2.4. pth-Order Difference Equations 33 

2.5. Initial Conditions and Unbounded Sequences 36 


References 42 


3 Stationary ARMA Processes 


3.1. Expectations, Stationarity, and Ergodicity 43 

3.2. White Noise 47 

3.3. Moving Average Processes 48 

3.4. Autoregressive Processes 53 

3.5. Mixed Autoregressive Moving Average 
Processes 59 


25 


43 


Ga Go 


5.7. 


oO 


The Autocovariance-Generating Function 61 
Invertibility 64 


APPENDIX 3.A. Convergence Results for Infinite-Order 
Moving Average Processes 69 


Exercises 70 References 71 


Forecasting 


Principles of Forecasting 72 

Forecasts Based on an Infinite Number 

of Observations 77 

Forecasts Based on a Finite Number 

of Observations 85 

The Triangular Factorization of a Positive Definite 
Symmetric Matrix 87 

Updating a Linear Projection 92 

Optimal Forecasts for Gaussian Processes 100 
Sums of ARMA Processes 102 

Wold’s Decomposition and the Box-Jenkins 
Modeling Philosophy 108 


APPENDIX 4.A. Parallel Between OLS Regression 
and Linear Projection 113 


APPENDIX 4.B. Triangular Factorization of the Covariance 
Matrix for an MA(1) Process 114 


Exercises 115 References 116 


Maximum Likelihood Estimation 


Introduction 117 

The Likelihood Function for a Gaussian AR(1) 
Process 118 

The Likelihood Function for a Gaussian AR(p) 
Process 123 

The Likelihood Function for a Gaussian MA(1) 
Process 127 

The Likelihood Function for a Gaussian MA(q) 
Process 130 


72 


117 


The Likelihood Function for a Gaussian ARMA(p, q) 


Process 132 
Numerical Optimization 133 


vi Contents 


5.8. 
5.9. 


8.2. 


8.3. 


9.1. 
9.2. 


Statistical Inference with Maximum Likelihood 
Estimation 142 
Inequality Constraints 146 


APPENDIX 5.A. Proofs of Chapter 5 Propositions 148 
Exercises 150 References 150 


Spectral Analysis 152 


The Population Spectrum 152 
The Sample Periodogram 158 
Estimating the Population Spectrum 163 
Uses of Spectral Analysis 167 


APPENDIX 6.A. Proofs of Chapter 6 Propositions 172 
Exercises 178 References 178 


Asymptotic Distribution Theory 180 


Review of Asymptotic Distribution Theory 180 
Limit Theorems for Serially Dependent 
Observations 186 


APPENDIX 7.A. Proofs of Chapter 7 Propositions 195 
Exercises 198 References 199 


Linear Regression Models 200 


Review of Ordinary Least Squares 

with Deterministic Regressors and i.i.d. Gaussian 
Disturbances 200 

Ordinary Least Squares Under More General 
Conditions 207 

Generalized Least Squares 220 


APPENDIX 8.A. Proofs of Chapter 8 Propositions 228 
Exercises 230 References 231 


Linear Systems of Simultaneous Equations 233 


Simultaneous Equations Bias 233 
Instrumental Variables and Two-Stage Least 
Squares 238 


Contents vii 


10 


10.1. 
10.2. 


10.3. 


10.4. 
10.5. 


Identification 243 

Full-Information Maximum Likelihood 
Estimation 247 

Estimation Based on the Reduced Form 250 
Overview of Simultaneous Equations Bias 252 


APPENDIX 9.A. Proofs of Chapter 9 Proposition 253 
Exercise 255 References 256 


Covariance-Stationary Vector Processes 257 


Introduction to Vector Autoregressions 257 
Autocovariances and Convergence Results 
for Vector Processes 261 

The Autocovariance-Generating Function 
for Vector Processes 266 

The Spectrum for Vector Processes 268 
The Sample Mean of a Vector Process 279 


APPENDIX 10.A. Proofs of Chapter 10 Propositions 285 
Exercises 290 References 290 


Vector Autoregressions 291 


Maximum Likelihood Estimation and Hypothesis 
Testing for an Unrestricted Vector 
Autoregression 291 

Bivariate Granger Causality Tests 302 

Maximum Likelihood Estimation of Restricted 
Vector Autoregressions 309 
The Impulse-Response Function 318 

Variance Decomposition 323 

Vector Autoregressions and Structural Econometric 
Models 324 ; 

Standard Errors for Impulse-Response 

Functions 336 


APPENDIX 11.A. Proofs of Chapter 11 Propositions 340 
APPENDIX 11.B. Calculation of Analytic Derivatives 344 
Exercises 348 References 349 


viii Contents 


12 Bayesian Analysis 351 


12.1. Introduction to Bayesian Analysis 351 
12.2. Bayesian Analysis of Vector Autoregressions 360 
12.3. Numerical Bayesian Methods 362 


APPENDIX 12.A. Proofs of Chapter 12 Propositions 366 


Exercise 370 References 370 
13. The Kalman Filter 372 


13.1. The State-Space Representation of a Dynamic 
System 372 

13.2. Derivation of the Kalman Filter 377 

13.3. Forecasts Based on the State-Space 
Representation 381 

13.4. Maximum Likelihood Estimation 
of Parameters 385 

13.5. The Steady-State Kalman Filter 389 

13.6. Smoothing 394 

13.7. Statistical Inference with the Kalman Filter 397 

13.8. Time-Varying Parameters 399 


APPENDIX 13.A. Proofs of Chapter 13 Propositions 403 


Exercises 406 References 407 
14. Generalized Method of Moments 409 


14.1. Estimation by the Generalized Method 
of Moments 409 
14.2. Examples 415 
14.3. Extensions 424 
14.4. GMM and Maximum Likelihood Estimation 427 


APPENDIX 14.A. Proofs of Chapter 14 Propositions 431 


Exercise 432 References 433 
*% 
15 Models of Nonstationary Time Series 435 


15.1. Introduction 435 
15.2. Why Linear Time Trends and Unit Roots? 438 


Contents ix 


15.3. 


15.4. 
15.5. 


16 
16.1. 
16.2. 


16.3. 


17 


17.1. 
17.2. 
17.3. 
17.4. 


17.5. 
17.6. 
17.7. 


17.8. 
17.9. 


Comparison of Trend-Stationary and Unit Root 
Processes 438 

The Meaning of Tests for Unit Roots 444 
Other Approaches to Trended Time Series 447 


APPENDIX 15.A. Derivation of Selected Equations 
for Chapter 15 451 


References 452 


Processes with Deterministic Time Trends 


Asymptotic Distribution of OLS Estimates 

of the Simple Time Trend Model 454 
Hypothesis Testing for the Simple Time Trend 
Model 461 

Asymptotic Inference for an Autoregressive 
Process Around a Deterministic Time Trend 463 


APPENDIX 16.A. Derivation of Selected Equations 
for Chapter 16 472 


Exercises 474 References 474 


Univariate Processes with Unit Roots 


Introduction 475 

Brownian Motion 477 

The Functional Central Limit Theorem 479 
Asymptotic Properties of a First-Order 
Autoregression when the True Coefficient Is 
Unity 486 

Asymptotic Results for Unit Root Processes 
with General Serial Correlation 504 
Phillips-Perron Tests for Unit Roots 506 
Asymptotic Properties of a pth-Order 
Autoregression and the Augmented Dickey-Fuller 
Tests for Unit Roots 516 

Other Approaches to Testing for Unit Roots 531 
Bayesian Analysis and Unit Roots 532 


APPENDIX 17.A. Proofs of Chapter 17 Propositions 534 
Exercises 537 References 541 


X Contents 


454 


475 


18 Unit Roots in Multivariate Time Series 544 
18.1. Asymptotic Results for Nonstationary Vector 

Processes 544 
18.2. Vector Autoregressions Containing Unit Roots 549 
18.3. Spurious Regressions 557 

APPENDIX 18.A. Proofs of Chapter 18 Propositions 562 

Exercises 568 References 569 
19 Cointegration S71 
19.1. Introduction 571 
19.2. Testing the Null Hypothesis of No 

Cointegration 582 
19.3. Testing Hypotheses About the Cointegrating 

Vector 601 

APPENDIX 19.A. Proofs of Chapter 19 Propositions 618 

Exercises 625 References 627 
20 = Full-Information Maximum Likelihood 

Analysis of Cointegrated Systems 630 
20.1. Canonical Correlation 630 
20.2. Maximum Likelihood Estimation 635 
20.3. Hypothesis Testing 645 
20.4. Overview of Unit Roots—To Difference 

or Not to Difference? 651 

APPENDIX 20.A. Proofs of Chapter 20 Propositions 653 

Exercises 655 References 655 
21. + Time Series Models of Heteroskedasticity 657 
21.1. Autoregressive Conditional Heteroskedasticity 

(ARCH) 657 
21.2. Extensions 665 


APPENDIX 21.A. Derivation of Selected Equations 
for Chapter 21 673 


References 674 


Contents xi 


22 


22.1. 
ees 
22.3. 


22.4. 


B 


C 


D 


Modeling Time Series with Changes 
in Regime 


Introduction 677 

Markov Chains 678 

Statistical Analysis of i.i.d. Mixture 
Distributions 685 

Time Series Models of Changes in Regime 690 


APPENDIX 22.A. Derivation of Selected Equations 


for Chapter 22 699 
Exercise 702 References 702 


Mathematical Review 


Trigonometry 704 

Complex Numbers 708 
Calculus 711 

Matrix Algebra 721 
Probability and Statistics 739 


References 750 


Statistical Tables 


Answers to Selected Exercises 


Greek Letters and Mathematical Symbols 
Used in the Text 


AUTHOR INDEX 789 


SUBJECT INDEX 792 


Xii Contents 


677 


704 


751 


769 


786 


Preface 


Much of economics is concerned with modeling dynamics. There has been an 
explosion of research in this area in the last decade, as “‘time series econometrics” 
has practically come to be synonymous with “empirical macroeconomics.” © 

Several texts provide good coverage of the advances in the economic analysis 
of dynamic systems, while others summarize the earlier literature on statistical 
inference for time series data. There seemed a use for a text that could integrate 
the theoretical and empirical issues as well as incorporate the many advances of 
the last decade, such as the analysis of vector autoregressions, estimation by gen- 
eralized method of moments, and statistical inference for nonstationary data. This 
is the goal of Time Series Analysis. 

A principal anticipated use of the book would be as a textbook for a graduate 
econometrics course in time series analysis. The book aims for maximum flexibility 
through what might be described as an integrated modular structure. As an example 
of this, the first three sections of Chapter 13 on the Kalman filter could be covered 
right after Chapter 4, if desired. Alternatively, Chapter 13 could be skipped al- 
together without loss of comprehension. Despite this flexibility, state-space ideas 
are fully integrated into the text beginning with Chapter 1, where a state-space 
representation is used (without any jargon or formalism) to introduce the key results 
concerning difference equations. Thus, when the reader encounters the formal 
development of the state-space framework and the Kalman filter in Chapter 13, 
the notation and key ideas should already be quite familiar. 

Spectral analysis (Chapter 6) is another topic that could be covered at a point 
of the reader’s choosing or skipped altogether. In this case, the integrated modular 
structure is achieved by the early introduction and use of autocovariance-generating 
functions and filters. Wherever possible, results are described in terms of these 
rather than the spectrum. 

Although the book is designed with an econometrics couse in time series 
methods in mind, the book should be useful for several other purposes. It is 
completely self-contained, starting from basic principles accessible to first-year 
graduate students and including an extensive math review appendix. Thus the book 
would be quite suitable for a first-year graduate course in macroeconomics or 
dynamic methods that has no econometric content. Such a course might use Chap- 
ters 1 and 2, Sections 3.1 through 3.5, and Sections 4.1 and 4.2. 

Yet another intended use for the book would be in a conventional econo- 
metrics course without an explicit time series focus. The popular econometrics texts 
do not have much discussion of such topics as numerical methods; asymptotic results 
for serially dependent, heterogeneously distributed observations; estimation of 
models with distributed lags; autocorrelation- and heteroskedasticity-consistent 


xiii 


standard errors; Bayesian analysis; or generalized method of moments. All of these 
topics receive extensive treatment in Time Series Analysis. Thus, an econometrics 
course without an explicit focus on time series might make use of Sections 3.1 
through 3.5, Chapters 7 through 9, and Chapter 14, and perhaps any of Chapters 
5, 11, and 12 as well. Again, the text is self-contained, with a fairly complete 
discussion of conventional simultaneous equations methods in Chapter 9. Indeed, 
a very important goal of the text is to develop the parallels between (1) the tra- 
ditional econometric approach to simultaneous equations and (2) the current pop- 
ularity of vector autoregressions and generalized method of moments estimation. 

Finally, the book attempts to provide a rigorous motivation for the methods 
and yet still be accessible for researchers with purely applied interests. This is 
achieved by relegation of many details to mathematical appendixes at the ends of 
chapters, and by inclusion of numerous examples that illustrate exactly how the 
theoretical results are used and applied in practice. 

The book developed out of my lectures at the University of Virginia. I am 
grateful first and foremost to my many students over the years whose questions 
and comments have shaped the course of the manuscript. I also have an enormous 
debt to numerous colleagues who have kindly offered many useful suggestions, 
and would like to thank in particular Donald W. K. Andrews, Stephen R. Blough, 
John Cochrane, George Davis, Michael Dotsey, Robert Engle, T. Wake Epps, 
Marjorie Flavin, John Geweke, Eric Ghysels, Carlo Giannini, Clive W. J. Granger, 
Alastair Hall, Bruce E. Hansen, Kevin Hassett, Tomoo Inoue, Ravi Jagannathan, 
Kenneth F. Kroner, Rocco Mosconi, Masao Ogaki, Adrian Pagan, Peter C. B. 
Phillips, Peter Rappoport, Glenn Rudebusch, Raul Susmel, Mark Watson, Kenneth 
D. West, Halbert White, and Jeffrey M. Wooldridge. I would also like to thank 
Pok-sang Lam and John Rogers for graciously sharing their data. Thanks also go 
to Keith Sill and Christopher Stomberg for assistance with the figures, to Rita 
Chen for assistance with the statistical tables in Appendix B, and to Richard Mickey 
for a superb job of copy editing. 


James D. Hamilton 


xiv Preface 


Time Series Analysis 


Difference Equations 


1.1. First-Order Difference Equations 


This book is concerned with the dynamic consequences of events over time. Let’s 
say we are studying a variable whose value at date tis denoted y,. Suppose we are 
given a dynamic equation relating the value y takes on at date ¢ to another variable 
w, and to the value y took on in the previous period: 


Ye = OY-1 + Wp [1.1.1] 


Equation [1.1.1] is a /inear first-order difference equation. A difference equation is 
an expression relating a variable y, to its previous values. This is a first-order 
difference equation because only the first lag of the variable (y,_,) appears in the 
equation. Note that it expresses y, as a linear function of y,_; and w,. 

An example of [1.1.1] is Goldfeld’s (1973) estimated money demand function 
for the United States. Goldfeld’s model related the log of the real money holdings of 
the public (m,) to the log of aggregate real income (J,), the log of the interest rate on 
bank accounts (r;,), and the log of the interest rate on commercial paper (r,): 


m, = 0.27 + 0.72m,_, + 0.19%, — 0.045r,, — 0.0197,,. {1.1.2] 
This is a special case of [1.1.1] with y, = m,, @ = 0.72, and 
w, = 0.27 + 0.19/, — 0.045r,, — 0.0197, 


For purposes of analyzing the dynamics of such a system, it simplifies the algebra 
a little to summarize the effects of all the input variables (J,, r,,, and r,,) in terms 
of a scalar w, as here. 

In Chapter 3 the input variable w, will be regarded as a random variable, and 
the implications of [1.1.1] for the statistical properties of the output series y, will be 
explored. In preparation for this discussion, it is necessary first to understand the 
mechanics of difference equations. For the discussion in Chapters 1 and 2, the values 
for the input variable {w,, w., . . .} will simply be regarded as a sequence of deter- 
ministic numbers. Our goal is to answer the following question: If a dynamic system 
is described by [1.1.1], what are the effects on y of changes in the value of w? 


Solving a Difference Equation by Recursive Substitution 


The presumption is that the dynamic equation [1.1.1] governs the behavior 
of y for all dates ¢. Thus, for each date we have an equation relating the value of 


1 


y for that date to its previous value and the current value of w: 


Date Equation 

0 Yo = by 1 + Wo [1.1.3] 

1 Yi = Pot Wy, [1.1.4] 

2 Yo = OY, + w, [1.1.5] 

t Ve = Prt We {1.1.6] 

If we know the starting value of y for date ¢ = —1 and the value of w for 
dates t = 0,1, 2,..., then it is possible to simulate this dynamic system to find 
the value of y for any date. For example, if we know the value of y fort = —1 


and the value of w for ¢ = 0, we can calculate the value of y for ¢ = 0 directly 
from [1.1.3]. Given this value of yg and the value of w fort = 1, we can calculate 
the value of y for ¢ = 1 from [1.1.4]: 


Yr = Yo + Wy = P(dy_1 + Wo) + Wi, 
or 
Yr = by_1 + GW + Wy. 


Given this value of y; and the value of w fort = 2, we can calculate the value of 
y for ¢ = 2 from [1.1.5]: 


Yo = by1 + W. = O(G7y_1 + GWo + Wy) + Wr, 
or 
yo = Py, + Gwo + GW, + Wy. 


Continuing recursively in this fashion, the value that y takes on at date ¢ can be 
described as a function of its initial value y_, and the history of w between date 
0 and date & ~ 


Ye = Otly_, + GW + Polw, + G-2w, + > + Ow, + Ww, [1.1.7] 


This procedure is known as solving the difference equation [1.1.1] by recursive 
substitution. 


Dynamic Multipliers 


Note that [1.1.7] expresses y, as a linear function of the initial value y_, and 
the historical values of w. This makes it very easy to calculate the effect of wo on 
y,. If wo were to change with y_,; and w;,, w2,..., w, taken as unaffected, the 
effect on y, would be given by 


Oe 
dWo 


= $f. [1.1.8] 


Note that the calculations would be exactly the same if the dynamic simulation 
were started at date ¢ (taking y,.; as given); then y,,,; could be described as a 


2 Chapter 1 | Difference Equations 


function of y,_; and w,, Wats... > Wray! 


Yeap = bitty, + bw, + owas, + $i. 2 


1.1.9 
Hi + GWea jer + Wray ! 
The effect of w, on y,,; is given by 
Benji 
ries ¢/. [1.1.10] 


Thus the dynamic multiplier [1.1.10] depends only on, the length of time separating 
the disturbance to the input (w,) and the observed value of the output (y,,,). The 
multiplier does not depend on ¢; that is, it does not depend on the dates of the 
observations themselves. This is true of any linear difference equation. 

As an example of calculating a dynamic multiplier, consider again Goldfeld’s 
money demand specification [1.1.2]. Suppose we want to know what will happen 
to money demand two quarters from now if current income J, were to increase by 
one unit today with future income J,,, and J,,, unaffected: 


OMe 2 = OM 42 x ow, 
él, ow, al, 


au, 
al,” 


From [1.1.2], a one-unit increase in J, will increase w, by 0.19 units, meaning that 
dw,/al, = 0.19. Since @ = 0.72, we calculate 


= ¢° X 


ax2 _ (9.72)2€,19) = 0.098. 
al, 
Because J, is the log of income, an increase in J, of 0.01 units corresponds to a 1% 
increase in income. An increase in m, of (0.01)-(0.098) = 0.001 corresponds to 
a 0.1% increase in money holdings. Thus the public would be expected to increase 
its money holdings by a little less than 0.1% two quarters following a 1% increase 
in income. 

Different values of ¢ in [1.1.1] can produce a variety of dynamic responses 
of y to w. If 0 < @ <1, the multiplier ay,,;/aw, in [1.1.10] decays geometrically 
toward zero. Panel (a) of Figure 1.1 plots ¢/ as a function of j for @ = 0.8. If 
—-1<¢< 0, the multiplier dy,, ,/aw, will alternate in sign as in panel (b). In this 
case an increase in w, will cause y, to be higher, y,,, to be lower, y,,. to be higher, 
and so on. Again the absolute value of the effect decays geometrically toward zero. 
If @ > 1, the dynamic multiplier increases exponentially over time as in panel (c). 
A given increase in w, has a larger effect the farther into the future one goes. For 
@ < —1, the system [1.1.1] exhibits explosive oscillation as in panel (d). 

Thus, if |¢| < 1, the system is stable; the consequences of a given change in 
w, will eventually die out. If |¢| > 1, the system is explosive. An interesting pos- 
sibility is the borderline case, @ = 1. In this case, the solution [1.1.9] becomes 


Yeap = Venn + We + Wear F Wega FoF Weg jen + Wray [1.1.11] 


Here the output variable y is the sum of the historical inputs w. A one-unit increase 
in w will cause a permanent one-unit increase in y: 


OY 4; 


=1 forj=0,1,.... 
aw, 


We might also be interested in the effect of w on the present value of the 
stream of future realizations of y. For a given stream of future values y,, y,41, 


1.1. First-Order Difference Equations 3 


42 12 


(a) @ = 0.8 (b) 6 = -0.8 
—_ oof 
(c) @ = 11 (d) @ = -11 


FIGURE 1.1 Dynamic multiplier for first-order difference equation for different 
values of ¢ (plot of dy,,,/aw, = $/ as a function of the lag /). 


Yr42, ..+ and a constant interest rate’ r > 0, the present value of the stream at 
time ¢ is given by 


Ye+t Ye+2 Ye+3 
ee ea 1.12 
ae ae (itr? (14+ 7) eet 


Let B denote the discount factor: 
B=1/(1 +r). 
Note that 0 < 8 < 1. Then the present value [1.1.12] can be written as 
» BY ea. [1.1.13] 
{=0 
Consider what would happen if there were a one-unit increase in w, with 
Wests Wy42, +» unaffected. The consequences of this change for the present value 


of y are found by differentiating [1.1.13] with respect to w, and then using [1.1.10] 


‘The interesl rate is measured here as a fraction of 1, thus r = 0.1 corresponds to a 10% interest 
rate. 


4 Chapter 1 | Difference Equations 


to evaluate each derivative: 


3 pi Be! = 5 pis! = 1101 — Be), [1.1.14] 


j=0 Ow, j=0 


provided that |@d| < 1. 

In calculating the dynamic multipliers [1.1.10] or [1.1.14], we were asking 
what would happen if w, were to increase by one unit with w,41, Wraos 5 Wee j 
unaffected. We were thus finding the effect of a purely transitory change in w. 
Panel (a) of Figure 1.2 shows the time path of w associated with this question, and 
panel (b) shows the implied path for y. Because the dynamic multiplier [1.1.10] 
calculates the response of y to a single impulse in w, it is also referred to as the 
impulse-response function. 


Time 


(a) Value of w 


1.2 


Time 


(b) Value of y 


FIGURE 1.2 Paths of input variable (w,) and output variable (y,) assumed for 
dynamic multiplier and present-value calculations. 


1.1. First-Order Difference Equations 5 


Sometimes we might instead be interested in the consequences of a permanent 
change in w. A permanent change in w means that w,, w,.:,... , and w,,; would 
all increase by one unit, as in Figure 1.3. From formula [1.1.10], the effect on y,,, 
of a permanent change in w beginning in period ¢ is given by 
Ora; OY, 4; OY rj OV p43 é . : 
oe ea Pa dno Bae ns eA) PY EO Ce ey eee a ea 

aw, aw, 4 1 OW, 4.2 OW 4; 
When |¢| < 1, the limit of this expression as j goes to infinity is sometimes described 
as the “long-run” effect of w on y: 


1+¢@+¢4+-°:- 


(1 — ¢). 


lim Ee + OY 4; + Or 4; Spice he ae 2] 


dw, OWr 41 OW, 42 OW, 4; [1.1.15] 


Joe 


Time 


(a) Value of w 


Time 


(b) Value of y 


FIGURE 1.3 Paths of input variable (w,) and output variable (y,) assumed for 
long-run effect calculations. 


6 Chapter 1 | Difference Equations 


For example, the long-run income elasticity of money demand in the system [1.1.2] 
is given by 


0.19 
Toon 0.68. 
A permanent 1% increase in income will eventually lead to a 0.68% increase in 
money demand. 

Another related question concerns the cumulative consequences for y of a 
one-time change in w. Here we consider a transitory disturbance to w as in panel 
(a) of Figure 1.2, but wish to calculate the sum of the consequences for all future 
values of y. Another way to think of this is as the effect on the present value of y 
[1.1.13] with the discount rate 8 = 1. Setting 8 = 1 in [1.1.14] shows this cumulative 
effect to be equal to 


> Byres 11 — 4), [1.1.16] 


j=0 aw, 


provided that |¢| < 1. Note that the cumulative effect on y of a transitory change 
in w (expression [1.1.16]) is the same as the long-run effect on y of a permanent 
change in w (expression [1.1.15]). 


1.2. pth-Order Difference Equations 


Let us now generalize the dynamic system [1.1.1] by allowing the value of y at date 
t to depend on p of its own lags along with the current value of the input variable 
Ww; 


Ye = BiY-1 + GoYrnw2 ter '* + dpYr-p + W, [1.2.1] 


Equation [1.2.1] is a linear pth-order difference equation. 

It is often convenient to rewrite the pth-order difference equation [1.2.1] in 
the scalar y, as a first-order difference equation in a vector &,. Define the (p X 1) 
vector & by 


Ye 
Ye-1 
=| yi2 |. [1.2.2 


Ye-p+1 


That is, the first element of the vector & at date ¢ is the value y took on at date ¢. 
The second element of & is the value y took on at date ¢ — 1, and so on. Define 
the (p X p) matrix F by 


o: $2 $3 .-. dy-1 dp 
07: es 
1 0 


1 * 
F=| 0 0 Oj. [1.2.3] 


i=) 
Deee 
i=) 


0 


1.2. pth-Order Difference Equations 7 


For example, for p = 4, F refers to the following 4 x 4 matrix: 


o: b2 os % 
r-|1 9 0 0 

0 1 0 OF 

0 0 1 0 


For p = 1 (the first-order difference equation [1.1.1]), F is just the scalar . Finally, 
define the (p x 1) vector v, by 


v,.=| 0]. [1.2.4] 


Consider the following first-order vector difference equation: 


& = F&1 + Vv [1.2.5] 
or 
ye o: gd $3... bp-1 db, Ye-1 WwW, 
Yn 100... 0 Of] y-2 0 
y-2 |}=]O 1 0 0 Ya} + | 0 
Vicahe O Oe SOY a. AOA gg, 0 


This is a system of p equations. The first equation in this system is identical to 
equation [1.2.1]. The second equation is simply the identity 


Vent = Ve-v 


owing to the fact that the second element of &, is the same as the first element of 
&,_,. The third equation in [1.2.5] states that y,.. = y,-2; the pth equation states 
that Ye~p+t = Ve-per 

Thus, the first-order vector system [1.2.5] is simply an alternative represen- 
tation of the pth-order scalar system [1.2.1]. The advantage of rewriting the pth- 
order system [1.2.1] in the form of a first-order system [1.2.5] is that first-order 
systems are often easier to work with than pth-order systems. 

A dynamic multiplier for [1.2.5] can be found in exactly the same way as was 
done for the first-order scalar system of Section 1.1. If we knew the value of the 
vector € for date t = ~1 and of v for date t = 0, we could find the value of & for 
date 0 from 


& = F§_, + vo. 
The value of & for date 1 is 
&, = F& + v, = F(FE_; + vo) + v, = F*E_, + Fvp + yy. 
Proceeding recursively in this fashion produces a generalization of [1.1.7]: 


& = Fig, + Fy) + Fly, + Fo?y, t+-- + + Fv,, ty, [1.2.6] 


8 Chapter 1 | Difference Equations 


Writing this out in terms of the definitions of & and v, 


Me Yu1 Wo Wi 
Ye-1 y-2 0 0 
y-2 |= Fy) + Fl 0] + FOO] +--- 
Ye-p+i Y~p 0 0 [1.2.7] 
W,m1 Wy 
0 0 
+FF} 0 | +140 


Consider the first equation of this system, which characterizes the value of y,. Let 
f 2 denote the (1, 1) element of F’, {9 the (1, 2) element of F’, and so on. Then 
the first equation of [1.2.7] states that 


Ye = FE Pye + PEP te + FM, + Fo [1.2.8] 
ne fiw, pe FPw,-1 + Ww. = 

This describes the value of y at date ¢ as a linear function of p initial values of y 

(y-1) Y-2) ++. , Y—p) and the history of the input variable w since time 0 (Wo, w;, 

. . , W,). Note that whereas only one initial value for y (the value y_,) was needed 

in the case of a first-order difference equation, p initial values for y (the values 
Y-1) Y-2) +++, Y-~p) are needed in the case of a pth-order difference equation. 

The obvious generalization of [1.1.9] is 


G4, = Fle + Fy, + Ftv. + F/~?y,,5 et? 


re [1.2.9] 


from which 


Very = FUP yas + ee ah ee Se ra ies OM ny f{Qw, 


[1.2.10] 
+ FY Wess az FY? Wie ie FW ra j-n + Wry. 


Thus, for a pth-order difference equation, the dynamic multiplier is given by 


r+; = FY 
we FR [1.2.11] 
where f{? denotes the (1, 1) element of F/. For j = 1, this is simply the (1, 1) 
element of F, or the parameter ¢,. Thus, for any pth-order system, the effect on 
Y:4, Of a one-unit increase in w, is given by the coefficient relating y, to y,, in 
equation [1.2.1]: 


OY ra =¢ 
= a 


12. nth-Order Niffer nce Fauntions 9 


Direct multiplication of [1.2.3] reveals that the (1, 1) element of F? is (¢7 + ¢), 
so 


OY 42 
ow, Ett 
in a pth-order system. 

For larger values of j, an easy way to obtain a numerical value for the dynamic 
multiplier dy,,,/w, is to simulate the system. This is done as follows. Set y_, = 
y-2 = °+°+ = y_p = 0, w = 1, and set the value of w for all other dates to 0. 
Then use [1.2.1] to calculate the value of y, for ¢ = 0 (namely, yp = 1). Next 
substitute this value along with y,_1, ¥:-2, - - - » ¥e-p+1 back into [1.2.1] to calculate 
Y,+1, and continue recursively in this fashion. The value of y at step ¢ gives the 
effect of a one-unit change in wo on y,. 

Although numerical simulation may be adequate for many circumstances, it 
is also useful to have a simple analytical characterization of dy,, ,/aw,, which, we 
know from [1.2.11], is given by the (1, 1) element of F’. This is fairly easy to obtain 
in terms of the eigenvalues of the matrix F, Recall that the eigenvalues of a matrix 
F are those numbers A for which 


|F - AL| = 0. [1.2.12] 


For example, for p = 2 the eigenvalues are the solutions to 


 e]-B- 
1 0 QA 
or 
Ke 7) f= - ba - = 0. [1.2.13] 
The two eigenvalues of F for a second-order difference equation are thus given by 
A= kins ea 5 Gf + 4b, [1.2.14] 
i $= VG + 5 i + Abo [1.2.15] 


For a general pth-order system, the determinant in [1.2.12] is a pth-order poly- 
nomial in A whose p solutions characterize the p eigenvalues of F. This polynomial 
turns out to take a very similar form to [1.2.13]. The following result is proved in 
Appendix 1.A at the end of this chapter. 


Proposition 1.1: The eigenvalues of the matrix F defined in equation [1.2.3] are the 
values of d that satisfy 


MP — GAP“? — GAP? — + — b-A — o = 0. [1.2.16] 


Once we know the eigenvalues, it is straightforward to characterize the dy- 
namic behavior of the system. First we consider the case when the eigenvalues of 
F are distinct; for example, we require that A, and A, in [1.2.14] and [1.2.15] be 
different numbers. 


10 Chapter 1 | Difference Equations 


General Solution of a pth-Order Difference Equation 
with Distinct Eigenvalues 


Recall? that if the eigenvalues of a (p X p) matrix F are distinct, there exists 
a nonsingular (p X p) matrix T such that 


F = TAT-! [1.2.17] 


where A is a(p X p) matrix with the eigenvalues of F along the principal diagonal 
and zeros elsewhere: 


Ap 0 0 aes 0 
O'» 0O--: 0 

A=]... [1.2.18] 
000---a 


P 


This enables us to characterize the dynamic multiplier (the (1, 1) element of 
F’ in [1.2.11]) very easily. For example, from [1.2.17] we can write F? as 


F2 


TAT"! x TAT“ 

Tx Ax (TT) xX Ax T" 
=TXxXAxI,xAx T7 

= TA7T-!, 


The diagonal structure of A implies that A? is also a diagonal matrix whose elements 
are the squares of the eigenvalues of F: 


2 0 0--- 0 

A = 

0 0 0--+ A 

More generally, we can characterize F/ in terms of the eigenvalues of F as 
F/ = TAT"! x TAT"! x - +--+ x TAT“! 


a 
j terms 


= TX AX (TT) XA xX (T"'T) X +--+ X AX T-), 


which simplifies to 


Fi = TA/T-! [1.2.19] 
where 

4 0 0:--- 0 

0 ARO: 0 

0 0 0 ++: Ag 


See equation [A.4.24] in the Mathematical Review (Appendix A) at the end of the book. 


1.2. pth-Order Difference Equations 11 


Let ¢, denote the row i, column j element of T and let ¢” denote the row i, column 
j element of T~+. Equation [1.2.19] written out explicitly becomes 

hi fa c'* bp aA 0 0--- 0 pl pz... pp 

by bo *'' by 0 an | 0 pi p22 we pe 


F=|- 
ti Go °° td LO 0 0 +++ ALJ Lert te? ++ per 
hid tedAg o> tpaAsp pet me... pe 
bah bad becanh bry #} 2 ae pe 
tyrA4 tort wate top ppl pp2 ws pp 
from which the (1, 1) element of F/ is given by 
FP = [eed + [tot ]ag + + + [tpePt]as 
or 
FQ = cAd + ag t+ +e,a4 [1.2.20] 
where 
¢; = [tit]. [1.2.21] 
Note that the sum of the c; terms has the following interpretation: 
Cot ey tt to, = [ttl] + [tot] +--+ + [typtPt], — [1.2.22] 


which is the (1, 1) element of T-T-!. Since T-T~! is just the (p x p) identity 
matrix, [1.2.22] implies that the c, terms sum to unity: 
CG +Qte tq =1. [1.2.23] 
Substituting [1.2.20] into [1.2.11] gives the form of the dynamic multiplier 
for a pth-order difference equation: 
Oj a j ; j 
ee CMA + Ag + +++ + C,A4. [1.2.24] 


Equation [1.2.24] characterizes the dynamic multiplier as a weighted average of 
each of the p eigenvalues raised to the jth power. 

The following result provides a closed-form expression for the constants (c;, 
C2, + y Cp). 


Proposition 1.2; If the eigenvalues (A,, Az, .. . , Ap) of the matrix F in [1.2.3] are 
distinct, then the magnitude c, in [1.2.21] can be written 
arcane as 
fT @ a8 
k=1 


k#i 


C; [1.2.25] 


To summarize, the pth-order difference equation [1.2.1] implies that 


Yeap = SE a + fey 2 ae Te ize [1.2.26] 
+ Weg FW FH PM pen FF Gea + yy. 


12 Chapter 1 | Difference Equations 


The dynamic multiplier 


Be = [1.2.27] 
is given by the (1, 1) element of F’: 
u% = FX. [1.2.28] 
A closed-form expression for yj can be obtained by finding the eigenvalues of F, 
or the values of A satisfying [1.2.16]. Denoting these p values by (A;, Ao, - - . 5 Ap) 
and assuming them to be distinct, the dynamic multiplier is given by 
w= At + Abt + + 6Ab [1.2.29] 
where (c,, ¢,,... , ¢,) is a set of constants summing to unity given by expression 
aes a first-order system (p = 1), this rule would have us solve [1.2.16], 
A- do; = 0, 
which has the single solution 
AL = [1.2.30] 
According to [1.2.29], the dynamic multiplier is given by 
Pee dal [1.2.31] 
From [1.2.23], c, = 1. Substituting this and [1.2.30] into [1.2.31] gives 
OY, 4 ; 
ee 


or the same result found in Section 1.1. 

For higher-order systems, [1.2.29] allows a variety of more complicated dy- 
namics. Suppose first that all the eigenvalues of F (or solutions to [1.2.16]) are 
real. This would be the case, for example, if p = 2 and ¢? + 4, > 0 in the 
solutions [1.2.14] and [1.2.15] for the second-order system. If, furthermore, all of 
the eigenvalues are less than 1 in absolute value, then the system is stable, and its 
dynamics are represented as a weighted average of decaying exponentials or de- 
caying exponentials oscillating in sign. For example, consider the following second- 
order difference equation: : 


y, = 0.6y,., + 0.2y,-2 + wW,. 
From equations [1.2.14] and [1.2.15], the eigenvalues of this system are given by 


ee 0.6 + V(0.6)? + 40.2) 


: ; = 0.84 
0.6 — V0.6)" + 40-2) 
hy SS re, 


From [1.2.25], we have 
C, = Ay/(A, — A,) = 0.778 
Cy = Ap/M(A, — A,) = 0.222, 
The dynamic multiplier for this system, 


Ora; 


aw, = C,A4 ar CA4, 


1.2. pth-Order Di erence Equations 13 


is plotted as a function of j in panel (a) of Figure 1.4.3 Note that as j becomes 
larger, the pattern is dominated by the larger eigenvalue (A,), approximating a 
simple geometric decay at rate A. 

If the eigenvalues (the solutions to [1.2.16]) are real but at least one is greater 
than unity in absolute value, the system is explosive. If A, denotes the eigenvalue 
that is largest in absolute value, the dynamic multiplier is eventually dominated by 
an exponential function of that eigenvalue: 


1 2 pee 
pos ow, AA 1 


Other interesting possibilities arise if some of the eigenvalues are complex. 
Whenever this is the case, they appear as complex conjugates. For example, if 
p = 2 and $2 + 4¢, <0, then the solutions A; and A, in [1.2.14] and [1.2.15] are 
complex conjugates. Suppose that A, and A, are complex conjugates, written as 


Ay = at bi [1.2.32] 
A, =a — bi. [1.2.33] 
For the p = 2 case of [1.2.14] and [1.2.15], we would have 
a= ¢$,/2 [1.2.34] 
b = (12) V— 4% — 44. [1.2.35] 


Our goal is to characterize the contribution to the dynamic multiplier c,4 
when A, is a complex number as in [1.2.32]. Recall that to raise a complex number 
to a power, we rewrite [1.2.32] in polar coordinate form: 


A, = R-[cos(@) + i-sin(6)], [1.2.36] 
where @ and R are defined in terms of a and b by the following equations: 
R= VP +P 
cos(@) = a/R 


sin(@) = b/R. 


Note that R is equal to the modulus of the complex number A,. 
The eigenvalue A, in [1.2.36] can be written as* 


A, = Rie”), 
and so 
A = Rife] = Ri[cos(6j) + i-sin(9j)]. [1.2.37] 
Analogously, if A. is the complex conjugate of A,, then 
A, = R[cos(@) — i-sin(6)], 
which can be written* 
A, = Rfe~*]. 
Thus 
AL = Rife] = Ri{cos(@j) — i-sin(@)]. [1.2.38] 


3 Again, if one’s purpose is solely to generate a numerical plot as in Figure 1.4, the easiest approach 
is numerical simulation of the system. 


‘See equation [A.3.25] in the Mathematical Review (Appendix A) at the end of the book. 
5See equation [A.3.26]. 


14 Chapter 1 | Difference Equations 


° 
1.2 
° 10 20 
Lag (ji) 
(a) , = 0.6, @ = 0.2 

1.2 
° 
-1.2 

° 20 


10 
Lag (i) 
(b) d, = 0.5, @ = —0.8 


FIGURE 1.4 Dynamic multiplier for second-order difference equation for differ- 
ent values of @, and ¢, (plot of dy,, ;/dw, as a function of the lag j). 


Substituting [1.2.37] and [1.2.38] into [1.2.29] gives the contribution of the complex 
conjugates to the dynamic multiplier dy,, /aw,: 

¢,R[cos(@) + i-sin(6/)] + c,R/[cos(@) — i-sin(4)] 
[c, + c,]-Ri-cos(@) + i-[e, — c2]-R/sin(6)). 


cad + c,A4 [1.2.39] 


The appearance of the imaginary number / in [1.2.39] may seem a little 
troubling. After all, this calculation was intended to give the effect of a change in 
the real-valued variable w, on the real-valued variable y,,; as predicted by the real- 
valued system [1.2.1], and it would be odd indeed if the correct answer involved 
the imaginary number i! Fortunately, it turns out from [1.2.25] that if A, and A, 
are complex conjugates, then c, and c, are complex conjugates; that is, they can 


1.2. pth-Order Difference Equations 15 


be written as 
Cc, = at pi 
Cc =a— Bi 
for some real numbers a and £. Substituting these expressions into [1.2.39] yields 
Aq + €,A4 = [(a + Bi) + (a — Bi)]-R/cos(6j) + i-[(@ + Bi) ~ (a — Bi)]-R/sin(6/) 
= [2a]-Ricos(6j) + i:[2Bi]-R/sin(@) 
= 2aR/cos(6/) — 2BR/sin(6), 


which is strictly real. 

Thus, when some of the eigenvalues are complex, they contribute terms 
proportional to R/cos(6j) and R/sin(@j) to the dynamic multiplier dy,, ;/aw,. Note 
that if R = 1—that is, if the complex eigenvalues have unit modulus—the mul- 
tipliers are periodic sine and cosine functions of j. A given increase in w, increases 
Ye+; for some ranges of j and decreases y,,, over other ranges, with the impulse 
never dying out as j — ~. If the complex eigenvalues are less than 1 in modulus 
(R < 1), the impulse again follows a sinusoidal pattern though its amplitude decays 
at the rate R/. If the complex eigenvalues are greater than 1 in modulus (R > 1), 
the amplitude of the sinusoids explodes at the rate R/. 

For an example of dynamic behavior characterized by decaying sinusoids, 
consider the second-order system 


¥, = 0.5y,-1 — 0.8y,-2 + w. 
The eigenvalues for this system are given from [1.2.14] and [1.2.15]: 


_ 0.5 + VOSY = 40.8) 
i 2 


0.5 - VO5)- — 40.8) 
a= aoe = 0.25 — 0.86i, 


ii = 0.25 + 0.86i 


with modulus 


R = V(0.25)? + (0.86)? = 0.9. 


Since R <1, the dynamic multiplier follows a pattern of damped oscillation plotted 
in panel (b) of Figure 1.4. The frequency® of these oscillations is given by the 
parameter @ in [1.2.39], which was defined implicitly by 


cos(@) = a/R = (0.25)/(0.9) = 0.28 
or 
6 = 1.29. 


The cycles associated with the dynamic multiplier function [1.2.39] thus have a 
period of 
2m _ (2)(3.14159) _ 


6 1.29 a) 


that is, the peaks in the pattern in panel (b) of Figure 1.4 appear about five periods 
apart. 


®See Section A.1 of the Mathematical Review (Appendix A) at the end of the book for a discussion 
of the frequency and period of a sinusoidal function. 


16 Chapter 1 | Difference Equations 


Solution of a Second-Order Difference Equation 
with Distinct Eigenvalues 


The second-order difference equation (p = 2) comes up sufficiently often 
that it is useful to summarize the properties of the solution as a general function 
of @, and @,, which we now do.’ 

The eigenvalues A, and A, in [1.2.14] and [1.2.15] are complex whenever 


oj + 4d, < 0, 


or whenever (¢,, $2) lies below the parabola indicated in Figure 1.5. For the case 
of complex eigenvalues, the modulus R satisfies 


R?2 = a + B, 
or, from [1.2.34] and [1.2.35], 
R? = (44/2)? — (3 + 46,)/4 = da. 


Thus, a system with complex eigenvalues is explosive whenever ¢, < —1. Also, 
when the eigenvalues are complex, the frequency of oscillations is given by 


6 = cos~'(@/R) = cos [o,/(2 V—¢,)], 


where “‘cos~ +(x)” denotes the inverse of the cosine function, or the radian measure 
of an angle whose cosine is x. 


SS 
Q 
oe. 
x 
Se 
oe 
ee 
ce 
~~ 
oe 
~ 
Ya 


real eigenvalues 


de<-l Sy, 


Teal eigenvalues 
(Aj<1 


» [TTT Pas 
oe complex eigenvalues < 
Iajet , 
N 


04 + 4,=0 | 
roar 


FIGURE 1.5 Summary of dynamics for a second-order difference equation. 
7This discussion closely follows Sargent (1987, pp. 188-89). 


1.2. pth-Order Difference Equations 17 


For the case of real eigenvalues, the arithmetically larger eigenvalue (A,) will 
be greater than unity whenever 


b, + VOI + fer 


1 
2 


or 
Voi + 46. >2 — oy. 
Assuming that A, is real, the left side of this expression is a positive number and 
the inequality would be satisfied for any value of ¢, > 2. If, on the other hand, 
, < 2, we can square both sides to conclude that A, will exceed unity whenever 
$3 + 46. > 4 - 46, + $F 
or 
$2. >1- dy. 


Thus, in the real region, A, will be greater than unity either if @, > 2 or if (¢,, 2) 
lies northeast of the line ¢. = 1 — @, in Figure 1.5, Similarly, with real eigenvalues, 
the arithmetically smaller eigenvalue (A,) will be less than —1 whenever - 


o: - Vier TACs 
2 
-VOi + 4h, < -2 - ca 
VOT + 4G, > 2 + dy. 
Again, if ¢, < —2, this must be satisfied, and in the case when ¢, > —2, we can 
square both sides: 
$3 + 4g > 4+ 46, + OF 
go. >1+ dy. 
Thus, in the real region, A, will be less than —1 if either 6, < —2 or (q,, ¢) lies 
to the northwest of the line ¢. = 1 + ¢, in Figure 1.5. 


The system is thus stable whenever (¢, $2) lies within the triangular region 
of Figure 1.5. 


General Solution of a pth-Order Difference Equation 
with Repeated Eigenvalues 


In the more general case of a difference equation for which F has repeated 
eigenvalues and s < p linearly independent eigenvectors, result [1.2.17] is gener- 
alized by using the Jordan decomposition, 


F = MJM-! [1.2.40] 
where M is a (p X p) matrix and J takes the form 
J, 0 -:- 0 
(| 


00 --- J 


18 Chapter I | Difference Equations 


with 


id . 


Rec 0 ee oh 
O and 0 0 
00 A, 0 0 

US ii Soke [1.2.41] 
0:0 0 oe 4p 
0.0 @: s-0 01a; 


for A; an eigenvalue of F. If [1.2.17] is replaced by [1.2.40], then equation [1.2.19] 
generalizes to 


F/ = MJ/M"! [1.2.42] 
where 
Wieoo-:: 0 
Ji = i u —_ : 
0 0 :-: 
Moreover, from [1.2.41], if J, is of dimension (n,; X n,), then® 
Mo (Q)A? GA? Gdaagrt? 
: 0 M (ait ee (, A)agrt2 
Helo. i a [1.2.43] 
0 0 0 oo M 


where 


ain = i)een soe ih 
0 otherwise. 


0) (aaa ees 
n 


Equation [1.2.43] may be verified by induction by multiplying [1.2.41] by [1.2.43] 
and noticing that (4) + (,/:) = (%!). 
For example, consider again the second-order difference equation, this time 


with repeated roots. Then 
: dM Oj,irh 
Fi = Ml all Jos 


Long-Run and Present-Value Calculations 


If the eigenvalues are all less than 1 in modulus, then F’ in [1.2.9] goes to 
zero as j becomes large. If all values of w and y are taken to be bounded, we can 


*This expression is taken from Chiang (1980, p. 444). 


1.2. pth-Order Difference Equations 19 


think of a “solution” of y, in terms of the infinite history of w, 
Ye = Ww, + dyad +. tomy. + wg ts [1.2.44] 


where y,, is given by the (1, 1) element of F/ and takes the particular form of [1.2.29] 
in the case of distinct eigenvalues. 

It is also straightforward to calculate the effect on the present value of y of 
a transitory increase in w. This is simplest to find if we first consider the slightly 
more general problem of the hypothetical consequences of a change in any element 
of the vector v, on any element of &,,; in a general system of the form of [1.2.5]. 
The answer to this more general problem can be inferred immediately from [1.2.9]: 


aE) 


= Fi. 1.2.4, 
a [1.2.45] 


The true dynamic multiplier of interest, dy, ,,/dw,, is just the (1, 1) element of the 
(p X p) matrix in [1.2.45]. The effect on the present value of & of a change in v 
is given by 


a j t+) = 
PoP = > AF! = (1, - BF)-, [1.2.46] 


ov; {20 


provided that the eigenvalues of F are all less than B~! in modulus. The effect on 
the present value of y of a change in w, 


a> BY; 
j=0 


ow, 


> 


is thus the (1, 1) element of the (p X p) matrix in [1.2.46]. This value is given by 
the following proposition. 


Proposition 1.3: If the eigenvalues of the (p x p) matrix F defined in [1.2.3] are 
all less than B~* in modulus, then the matrix (1, — BF)~' exists and the effect of 
w on the present value of y is given by its (1, 1) element: 


(1 7 8 = $28? as ah? ¢,-1B?~? ~ $B"). 


Note that Proposition 1.3 includes the earlier result for a first-order system 
(equation [1.1.14]) as a special case. 

The cumulative effect of a one-time change in w, on y,, y,41, --. can be 
considered a special case of Proposition 1.3 with no discounting. Setting 8 = 1 in 
Proposition 1.3 shows that, provided the eigenvalues of F are all less than 1 in 
modulus, the cumulative effect of a one-time change in w on y is given by 


5 e+ 
> Yeas = (1 = 1 — dy — tte Gp): [1.2.47] 


j=0 aw, 


Notice again that [1.2.47] can alternatively be interpreted as giving the even- 
tual long-run effect on y of a permanent change in w: 


Orn; Wea; Yes; OVe4; 
fic ee EE ae a ee a ah ae Sey. 


foe OW, OW,41 OW 42 OW: 4; 


20 Chapter I'| Difference Equations 


APPENDIX 1.A. Proofs of Chapter 1 Propositions 
m Proof of Proposition 1.1. The eigenvalues of F satisfy 

|F — AL| = 0. [1.A.1] 
For the matrix F defined in equation [1.2.3], this determinant would be 


od ¢2 b *°* b-1 O A 0 0 0 0 
|}1 0 O +--+ O 0 0A0-+: 00 
‘lo 1 0 0 of|_]0 0 a---- 00 
/LO 0 0 1 0 000 0A 

(¢, 2, A) oz ds dp-1 dy 

1 ~vA 0 0 0 

= 9 L cma ee) Os Aa] 
0 0 0 eee 1 =f 


Recall that if we multiply a column of a matrix by a constant and add the result to another 
column, the determinant of the matrix is unchanged. If we multiply the pth column of the 
matrix in [1.A.2] by (1/A) and add the result to the (p — 1)th column, the result is a matrix 
with the same determinant as that in [1.A.2]: 


-—A dbp bs +** bp-2 par + (b,/A) oO, 

1 -A OQ «ss O 0 0 

Ll -A +++ 0 0 0 

IF - AL! = Laue & : 
0 0 O vrs 1 -A 0 

0 0 oO -:: Q 0 -A 


Next, multiply the (p — 1)th column by (1/A) and add the result to the (p — 2)th column: 
IF ~ AL,| 


b,—A db, bs +++ by-a + by-A + b/d? by-1 + OA by 
1  -A O +e 0 0 0 
0 1 -A + 0 0 0 
0 0 O -: 0 -A 0 
0 0 O -: 0 0 -A 


Continuing in this fashion shows [1.A.1] to be equivalent to the determinant of the following 
upper triangular matrix: 


IF ~ All 
by — A + GA + Gyld? + os + HlAPm! hp + GYA + GPA? Hee +H GIMP? 2+ Gor + GA dy 
0 -A see i) 0 
0 0 tee i) i) 
i) 0 ee -A 0 
i) 0 vee 0 “A 


But the determinant of an upper triangular matrix is simply the product of the terms along 
the principal diagonal: 

|F — AL, | = [6 — A + @/A + BJA? +--+ + Gfarm] [-A]e7? [1.4.3] 
= (1)? + [ar — oA?) - GaP? — + + -— vor 


Ap endix 1.A. Proofs of Cha ter 1 Pro ositions 21 


The eigenvalues of F are thus the values of A for which [1.A.3] is zero, or for which 
AP — GAP! — GAP? — +++ — & = 0, 
as asserted in Proposition 1.1. 
m Proof of Proposition 1.2. Assuming that the eigenvalues (A,, Az, .. . , A,) are distinct, 


the matrix T in equation [1.2.17] can be constructed from the eigenvectors of F. Let t, 
denote the following (p x 1) vector, 


Apt 
Aro? 
aps 
t= - | [1.A.4] 
al . 
1 
where A, denotes the ith eigenvalue of F. Notice 
Apo! 
di $, od, *'° bp~1 db, ap? 
ee re ae | ee 
f,=|0 1 0 --- 0 0 a 
: E : : 1 
00 0 + 1 0 3 
[1.A.5] 
PAPm! + PaAPm? + APF + 0+ + by + O, 
apo! 
apo? 
MF 
A, 
Since A, is an eigenvalue of F, it satisfies [1.2.16]: 
AP — Apo! — b,Ap-? — = ++ — bys; — b = 0. [1.4.6] 
Substituting [1:A.6] into [1.A.5] reveals 
2 dent 
Ap! Ap~? 
Apo? Ap-3 
Ft, = ae Ml. 
a? a} 
x, 1 
or 
Ft; = A,t,. [1.4.7] 


Thus t, is an eigenvector of F associated with the eigenvalue A,. 
We can calculate the matrix T by combining the eigenvectors (t,, t,,..., t,) into a 
(p X p) matrix 


T=[t t «+: t). [1.4.8] 


To calculate the particular values for c, in equation [1.2.21], recall that T~' is char- 
acterized by 


=, [1.A.9] 


22 Chapter 1 | Difference Equations 


where T is given by [1.A.4] and [1.A.8]. Writing out the first column of the matrix system 
of equations [1.A.9] explicitly, we have 


Agn} Ago! ante es Ago! pit 1 
Ag-? Ag-? Pee mrs Ago? rx) 
AR~3 Ag-3 Ags 1 0 
Ay Ab et ay AL pe-id 0 
1 1 eee 1 pp! 0 
This gives a system of p linear equations in the p unknowns (f"', #71, ... , 2°), Provided 
that the A, are all distinct, the solution can be shown to be® 
pts gt ae 
(A, - A2)(Ay — As)+°> (A, - Ap) 
1 


a 


els 


7 (Az oF: Ay)(A2 —As)ec: Q2 re Ap) 


1 
(Ap — Ar)(Ap — Aa) + (AQ Apa) 
Substituting these values into [1.2.21] gives equation [1.2.25]. mi 


pis 


m Proof of Proposition 1.3. The first claim in this proposition is that if the eigenvalues of 
F are less than B~' in modulus, then the inverse of (I, — BF) exists. Suppose the inverse 
of (I, — BF) did not exist, Then the determinant |I, — BF| would have to be zero. But 


(I, — BF | Ea |-B :([F = B- Il > (-B)IF os B-'I,|, 
so that |F — B-"I,| would have to be zero whenever the inverse of (I, — BF) fails to exist. 
But this would mean that 6~' is an eigenvalue of F, which is ruled out by the assumption 
that all eigenvalues of F are strictly less than 8~' in modulus. Thus, the matrix I, — BF 


must be nonsingular. 
Since [I, — AF]~' exists, it satisfies the equation 


[1 - AF)“l, - AF] = 1,. [1.4.10] 
Let x, denote the row i, column j element of [I, — BF]-1, and write [1.A.10] as 

My 2 8 My 1~ Bd, -Bd, - °° — Bd, -1 — Bd, 

Xy Xp ° Xap -B 1 sueee 0 0 

Xpr X%p2 °° * App 0 0 See -B 1 [1.A.11] 
10 ++: 0 
01: 0 
0 0 ee 1 


The task is then to find the (1, 1) element of [I, — BF]~', that is, to find the value 
of x,,. To do this we need only consider the first row of equations in [1.A.11]: 


f= Bd, —Bdz, +: —Bd,-1 - Bd, 
~B 1 apes: 0 0 
Pu 42 °c ipl : . : 
0 --- O Of. [1.A.12] 
*See Lemma 2 of Chiang (1980, p. 144). 


Appendix 1.A. Proofs of Chapter I Propositions 23 


Consider postmultiplying this system of equations by a matrix with 1s along the principal 
diagonal, # in the row p, column p — 1 position, and Os elsewhere: 


10 0 0 
01 :--- 00 
00--- Bl 


The effect of this operation is to multiply the pth column of a matrix by B and add the result 
to the (p — 1)th column: 


1- Bd, -Bd, +--+ —8d,-,— 6°, — Be, 
-—p 1 ates 0 0 
[x1 Xi2' * Xp] : : als : : =([1 0---0 Q]. 
a a 0 1 


Next multiply the (p — 1)th column by @ and add the result to the (p — 2)th column. 
Proceeding in this fashion, we arrive at 


[xu %y2 °° + Fy] X 
1 ~ by~ B°by~ ++ = BP~"by 1 Py ~ Bba~ Bbs~ "+ BPN, + — Bb 1-84, ~ BO, 
0 1 tee 0 0 
0 0 ei 0 1 


=(1 0 -+- 0 O}. [1.4.13] 
The first equation in [1.A.13] states that 
*11° (1 — Bo, — B?d, — +++ — BP'd,-, — BPd,) = 1 
or 
xn = (1 - Bd, — B*d, - +--+ — BPd,), 
aS claimed in Proposition 1.3. m 


Chapter I References 


Chiang, Chin Long. 1980. An Introduction to Stochastic Processes and Their Applications. 
Huntington, N.Y.: Krieger. 


Goldfeld, Stephen M. 1973. “The Demand for Money Revisited,” Brookings Papers on 
Economic Activity 3:577-638. 


Sargent, Thomas J. 1987. Macroeconomic Theory, 2d ed. Boston: Academic Press, 


24 Chapter 1 | Difference Equations 


Lag Operators 


2.1. Introduction 


The previous chapter analyzed the dynamics of linear difference equations using 
matrix algebra. This chapter develops some of the same results using time series 
operators. We begin with some introductory remarks on some useful time series 
operators. 

A time series is a collection of observations indexed by the date of each 
observation. Usually we have collected data beginning at some particular date (say, 
t = 1) and ending at another (say, t = T): 


(Yu Yar Yr) 


We often imagine that we could have obtained earlier observations (yo, Y-1; 
y_2,...) or later observations (y7,1, Yr+2,---) had the process been observed 
for more time. The observed sample (y,, y2, ... , Yr) could then be viewed as a 
finite segment of a doubly infinite sequence, denoted {y}* _..: 


{y dene = fe Yaa Yor Yur Yar Vr Veet Vraay + a 


eye 
observed sample 


Typically, a time series {y,}"_ _. is identified by describing the th element. 
For example, a time trend is a series whose value at date t is simply the date of the 
observation: 


ye = h. 


We could also consider a time series in which each element is equal to a constant 
c, regardless of the date of the observation t: 


y=He, 
Another important time series is a Gaussian white noise process, denoted 
y= & 


where {e,}7_ _. is a sequence of independent random variables each of which has 
a N(0, o7) distribution. 

We are used to thinking of a function such as y = f(x) or y = ate; w) as an 
operation that accepts as input a number (x) or group of numbers (x, w) and 
produces the output (y). A time series operator transforms one time series or group 


25 


of time series into a new time series. It accepts as input a sequence such as 
{x}=._. or a group of sequences such as ix} —, {w} _..) and has as output a 
new sequence {y}/_ _... Again, the operator is summarized by describing the value 
of a typical element of {y}%__. in terms of the corresponding elements of 


{x}F. 2 
An example of a time series operator is the multiplication operator, repre- 
sented as 


= Bx, [2.1.1] 
Although it is written exactly the same way as simple scalar multiplication, equation 
[2.1.1] is actually shorthand for an infinite sequence of multiplications, one for 
each date t. The operator multiplies the value x takes on at any date t by some 
constant B to generate the value of y for that date. 
Another example of a time series operator is the addition operator: 


y, = X%, + Ww, 


Here the value of y at any date ¢ is the sum of the values that x and w take on for 
that date. 

Since the multiplication or addition operators amount to element-by-element 
multiplication or addition, they obey all the standard rules of algebra. For example, 
if we multiply each observation of {x} . by B and each observation of 
{w,}®. . by 8 and add the results, 


Bx, + Bw, 


the outcome is the same as if we had first added {x}7__, to {w}*._. and then 
multiplied each element of the resulting series by 8: 


B(x, + w,). 
A highly useful operator is the lag operator. Suppose that we start with a 


sequence {x}*_ _,, and generate a new sequence {y,}?_ _., where the value of y for 
date t is equal to the value x took on at date t — 1: 


Vr = X21: [2.1.2] 


This is described as applying the lag operator to {x,}7_ _... The operation is repre- 
sented by the symbol L: 


Lxy = X21. [2.1.3] 
Consider the result of applying the lag operator twice to a series: 
L(Lx,) = L(%4-1) = %-2. 
Such a double application of the lag operator is indicated by “L?”: 
Lx, = x2. 
In general, for any integer k, 
Lx, = X45 [2.1.4] 


Notice that if we first apply the multiplication operator and then the lag 
operator, as in 


X,—> Bx,— Bx,-1, 


the result will be exactly the same as if we had applied the lag operator first and 
then the multiplication operator: 


X, > X21 > BX-1- 


26 Chapter 2 | Lag Operators 


Thus the lag operator and multiplication operator are commutative: 
L(Bx,) = B-Lx,. 
Similarly, if we first add two series and then apply the lag operator to the result, 
(%,, W,) > X, + W,> HH4 + Wa, 
the result is the same as if we had applied the lag operator before adding: 
(Xp, We) > (p- as Wea) > Hr + Wad 
Thus, the lag operator is distributive over the addition operator: 
L(x, + w,) = Lx, + Lw,. 

We thus see that the lag operator follows exactly the same algebraic rules as 
the multiplication operator. For this reason, it is tempting to use the expression 
“multiply y, by L” rather than “operate on {y}7__.. by L.”’ Although the latter 
expression is technically more correct, this text will often use the former shorthand 
expression to facilitate the exposition. 

Faced with a time series defined in terms of compound operators, we are free 
to use the standard commutative, associative, and distributive algebraic laws for 


multiplication and addition to express the compound operator in an alternative 
form, For example, the process defined by 


y, = (a + bL)Lx, 
is exactly the same as 
y, = (aL + bL*)x, = ax,1 + b%,-2. 
To take another example, 
(1 — A,L)(1 — A,L)x, = (1 — AVL —-A,L + A,A_L*)x, 
= (1 ~ [A, + AJL + A,A,L?)x, [2.1.5] 
Xp = (Ay + Ag)e + (ArAg)e—-2. 


An expression such as (aL + 61?) is referred to as a polynomial in the lag 
operator. It is algebraically similar to a simple polynomial (az + bz?) where z is 
a scalar. The difference is that the simple polynomial (az + bz?) refers to a 
particular number, whereas a polynomial in the lag operator (aL + bL?) refers to 
an operator that would be applied to one time series {x}% _.. to produce a new 
time series {y}7 _... 

Notice that if {x}7_ _.. is just a series of constants, 


x, =C for all t, 
then the lag operator applied to x, produces the same series of constants: 
Lx, = 4 = ¢. 
Thus, for example, 
(aL + BL* + yL)c = (a + B+ y)- Cc. [2.1.6] 


2.2. First-Order Difference Equations 
Let us now return to the first-order difference equation analyzed in Section 1,1: 
Ye = OY + We [2.2.1] 


2.2, First-Order Difference Equations 27 


Equation [2.2.1] can be rewritten using the lag operator [2.1.3] as 


y, = OLy, + w,. 
This equation, in turn, can be rearranged using standard algebra, 
¥ ~ GLY, = Ww, 
or 
(1 — @L)y, = w,. [2.2.2] 
Next consider “multiplying” both sides of [2.2.2] by the following operator: 
(l+ ob + @L2 + PL ++-- + PL). [2.2.3] 


The result would be 
(1+ @L+ @2?+ PL? +--+ 4+ PLy(l - Ly, 
=(1+ ¢@L+ @2? + PL +--+ + PL)w,, 
Expanding out the compound operator on the left side of [2.2.4] results in 
Pp g p 
(1+ o@L+@L?74+ PL +--++ PL) - oL) 
=(1+ ¢L + @L?+ PL? +--++ GL) 
-(l+@L+@L?+ PL? +-+--+PL)bL [2.2.5] 
=(1+ @L+@L?+ PL? +---+@L/ 
- (@L + ol? + PL te -64+ gli + git 
= (1 cae Pritt, 
Substituting [2.2.5] into [2.2.4] yields 
(QQ - @ttLt)y, = (1+ 6b + PL? + PLF ++++ + PL )w, [2.2.6] 
Writing [2.2.6] out explicitly using [2.1.4] produces 


ye PY, 41) = w, + ow,_1 + GPW,_2 + ?W,-3 2 ea: Ww, + 


[2.2.4] 


or 

Ben g*ly_y + w, + @w,-1 + Pw, _2 + Pw,3 tse + PW. [2.2.7] 

Notice that equation [2.2.7] is identical to equation [1.1.7]. Applying the 
operator [2.2.3] is performing exactly the same set of recursive substitutions that 
were employed in the previous chapter to arrive at [1.1.7]. 

It is interesting to reflect on the nature of the operator [2.2.3] as t becomes 
large. We saw in [2.2.5] that 

(1 + OL + GL? + PL +--+ + PLL - bLy, = y - HY. 


That is, (1 + @L + @L? + PL? + +--+ + PLY(1 — OL)y, differs from y, by 
the term ¢’*'y_,. If |] < 1 and if y_, is a finite number, this residual ¢'*'y_, 
will become negligible as t becomes large: 


(1+ oh + @L? + PLF +--+ + PL) - oL)y,=y, — fort large. 


A sequence {y}*__.. is said to be bounded if there exists a finite number Y such 
that 


lyl<y for all ¢. 


Thus, when |¢| < 1 and when we are considering applying an operator to a bounded 
sequence, we can think of 


(1+ @L + gL? + PLP +--+ + PIL) 


28 Chapter 2 | Lag Operators 


as approximating the inverse of the operator (1 — @L), with this approximation 
made arbitrarily accurate by choosing j sufficiently large: 


(1 — @L)-! = lim (1 + @L + PL? + GLP +--+ + PIL), [2.2.8] 
ie 


This operator (1 — @L)~' has the property 
(1 - $1)" - @L) = 1, 
where “1” denotes the identity operator: 
ly, = Yp 

The following chapter discusses stochastic sequences rather than the deter- 
ministic sequences studied here. There we will speak of mean square convergence 
and stationary stochastic processes in place of limits of bounded deterministic 
sequences, though the practical meaning of [2.2.8] will be little changed. 

Provided that |¢| < 1 and we restrict ourselves to bounded sequences or 
stationary stochastic processes, both sides of [2.2.2] can be “divided” by (1 — @L) 
to obtain 

y= (l- $¢L)~'w, 
or 
y, = W, + OW,1 + Pw,_2 + OW,3 + °°". [2.2.9] 


It should be emphasized that if we were not restricted to considering bounded 
sequences or Stationary stochastic processes {w}* _.. and {y,}%. _.., then expression 
[2.2.9] would not be a necessary implication of [2.2.1]. Equation [2.2.9] is consistent 
with [2.2.1], but adding a term a,¢’, 


y, = api + w+ bw, + ¢'w,2 + @wi3 tees, [2.2.10] 


produces another series consistent with [2.2.1] for any constant a,. To verify that 
[2.2.10] is consistent with [2.2.1], multiply [2.2.10] by (1 — #L): 


(1 — @L)y, = (1 — oL)ag' + (1 — PL)(1 — 6L)~'w, 
agp! — gag) + wW, 


= Wr 


so that [2.2.10] is consistent with [2.2.1] for any constant a,. 
Although any process of the form of [2.2.10] is consistent with the difference 
equation [2.2.1], notice that since |¢| < 1, 


lang|> as tm, 
Thus, even if {w}* _..is a bounded sequence, the solution {y,}*. _.. given by [2.2.10] 
is unbounded unless a, = 0 in [2.2.10]. Thus, there was a particular reason for 
defining the operator [2.2.8] to be the inverse of (1 — ¢L)—namely, (1 — @L)~! 
defined in [2.2.8] is the unique operator satisfying 
(1 — @L)"*(1 - @L) = 1 
that maps a bounded sequence {w}* _.. into a bounded sequence {y,}7_— =. 
The nature of (1 — @L)~! when |¢| = 1 will be discussed in Section 2.5. 


2.3. Second-Order Difference Equations 
Consider next a second-order difference equation: 
Ye = PrYi-1 + bayr-2 + Wy. [2.3.1] 


' 2.3. Second-Order Difference Equations 29 


Rewriting this in lag operator form produces 
(1 - $L — $,L*)y, = w,. [2.3.2] 


The left side of [2.3.2] contains a second-order polynomial in the lag operator 
L. Suppose we factor this polynomial, that is, find numbers A, and A, such that 


(1 — @&L — $27) = (1-A,L)(1—- AL) = (1- [A $+ AIL + AVAL). [2.3.3] 


This is just the operation in [2.1.5] in reverse. Given values for ¢, and ¢,, we seek 
numbers A, and A, with the properties that 


Ay tA, = hh 
and 
Aya, = — 2. 


For example, if ¢, = 0.6 and ¢, = —0.08, then we should choose A, = 0.4 and 
A, = 0.2: 


(1 — 0.6L + 0.0822) = (1 — 0.4L)(1 - 0.2L). [2.3.4] 


It is easy enough to see that these values of A, and A, work for this numerical 
example, but how are A, and A, found in general? The task is to choose A, and Az 
so as to make sure that the operator on the right side of [2.3.3] is identical to that 
on the left side. This will be true whenever the following represent the identical 
functions of z: 


(1 — dz — $22) = (1 — Ayz)(1 — AQ2). [2.3.5] 


This equation simply replaces the lag operator L in [2.3.3] with a scalar z. What 
is the point of doing so? With [2.3.5], we can now ask, For what values of z is the 
right side of [2.3.5] equal to zero? The answer is, if either z = Ay+ or z = Az}, 
then the right side of [2.3.5] would be zero. It would not have made sense to ask 
an analogous question of (2.3.3]—L denotes a particular operator, not a number, 
and L = d;1 is not a sensible statement. 

Why should we care that the right side of [2.3.5] is zero if z = Ay} orifz = 
Az'? Recall that the goal was to choose A, and A, so that the two sides of [2.3.5] 
represented the identical polynomial in z. This means that for any particular value 
z the two functions must produce the same number. If we find a value of z that 
sets the right side to zero, that same value of z must set the left side to zero as 
well. But the values of z that set the left side to zero, 


(1 — dz -— dz?) = 0, [2.3.6] 
are given by the quadratic formula: 
b - VOT 
Rie a [2.3.7] 
m= a ee [2.3.8] 


Setting z = z, or Zz, makes the left side of [2.3.5] zero, while z = Ay} or 
Az! sets the right side of [2.3.5] to zero. Thus 


Apli=Hy [2.3.9} 
Az? = 22 [2.3.10] 


30 Chapter 2 | Lag Operators 


Returning to the numerical example [2.3.4] in which ¢, = 0.6 and ¢, = —0.08, 
we would calculate 


0.6 — VO.6)? — 400.08 
z= —— 70.08) = 2.5 
0.6 + VO.6" — 40.08) 
a2> 2(0.08) See 
and so 
A, = (2.5) = 0.4 
dy = 15.0) = 0.2, 


as was found in [2.3.4]. 

When $7 + 4¢2 < 0, the values z, and z, are complex conjugates, and their 
reciprocals 4, and A, can be found by first writing the complex number in polar 
coordinate form. Specifically, write 


Zz, =at bi 
as 
z, = R'{oos(@) + isin(@)] = Ree 
Then 
27) = Rohe = R--[cos(6) — i-sin(é)]. 


Actually, there is a more direct method for calculating the values of A, and 
A, from ¢, and 2. Divide both sides of [2.3.5] by z?: 


(27? — $27! — g) = (277 — A)Y(z7? — A) [2.3.11] 
and define A to be the variable z=: 
deze, [2.3.12] 
Substituting [2.3.12] into [2.3.11] produces 
(A? — A — gd) = (A — Ag)(A — Ad). [2.3.13] 


Again, [2.3.13] must hold for all values of A in order for the two sides of [2.3.5] 
to represent the same polynomial. The values of A that set the right side to zero 
are A = A, andA = A,. These same values must set the left side of [2.3.13] to zero 
as well: 


(2 — dA — ¢,) = 0. [2.3.14] 


Thus, to calculate the values of A, and A, that factor the polynomial in [2.3.3], we 
can find the roots of [2.3.14] directly from the quadratic formula: 


A= atv [2.3.15] 
pip cme A emia a [2.3.16] 


For the example of [2.3.4], we would thus calculate 


0.6 + V0.6)" — 4(0.08) 


1 


0.6 — V(0.6)* — 4(0.08) 


2.3. Second-Order Difference Equations 31 


It is instructive to compare these results with those in Chapter 1. There the 
dynamics of the second-order difference equation [2.3.1] were summarized by 
calculating the eigenvalues of the matrix F given by 


F = [¢: a [2.3.17] 


The eigenvalues of F were seen to be the two values of A that satisfy equation 
[1.2.13]: 


(A? — iA — @) = 9. 


But this is the same calculation as in [2.3.14]. This finding is summarized in the 
following proposition. 


Proposition 2.1: Factoring the polynomial (1 — @,L — ¢,L7) as 
(1 — &L — @&L) = (1 ~— AL) — AL) [2.3.18] 
is the same calculation as finding the eigenvalues of the matrix F in [2.3.17]. The 


eigenvalues d, and 2, of F are the same as the parameters 4, and dz in [2.3.18], and 
are given by equations [2.3.15] and [2.3.16]. 


The correspondence between calculating the eigenvalues of a matrix and 
factoring a polynomial in the lag operator is very instructive. However, it introduces 
one minor source of possible semantic confusion about which we have to be careful. 
Recall from Chapter 1 that the system [2.3.1] is stable if both A, and A, are less 
than 1 in modulus and explosive if either A, or Az is greater than 1 in modulus. 
Sometimes this is described as the requirement that the roots of 


(A — @A — ) =0 [2.3.19] 


lie inside the unit circle. The possible confusion is that it is often convenient to 
work directly with the polynomial in the form in which it appears in [2.3.2], 

(1 - dz — 27) = 0, [2.3.20] 
whose roots, we have seen, are the reciprocals of those of [2.3.19]. Thus, we could 
say with equal accuracy that “the difference equation [2.3.1] is stable whenever 
the roots of [2.3.19] lie inside the unit circle” or that “‘the difference equation 
[2.3.1] is stable whenever the roots of [2.3.20] lie outside the unit circle.” The two 
statements mean exactly the same thing. Some scholars refer simply to the “roots 
of the difference equation [2.3.1],”’ though this raises the possibility of confusion 
between [2.3.19] and [2.3.20]. This book will follow the convention of using the 
term “eigenvalues” to refer to the roots of [2.3.19]. Wherever the term “roots”’ is 
used, we will indicate explicitly the equation whose roots are being described. 

From here on in this section, it is assumed that the second-order difference 
equation is stable, with the eigenvalues A, and A, distinct and both inside the unit 
circle. Where this is the case, the inverses 


(1 — A,L)72 = 1 + ALL + AFL? + RLF + --- 
(1 — AL)? = 14+ ASL + ABL27 4+ ARP +--- 
are well defined for bounded sequences. Write [2.3.2] in factored form: 
(1 — AL)(L - AaL)y, = 
and operate on both sides by (1 — A,L)~*#(1 — A2L)7?: 
ye = (1 — AQL)“M(1 — AgL)~*w,. [2.3.21] 


32 Chapter 2 | Lag Operators 


Following Sargent (1987, p. 184), when A, #A;, we can use the following operator: 


rN a 
= -1 1 eo 2. 
(A; — Ag) {3 =i be I. [2.3.22] 


Notice that this is simply another way of writing the operator in [2.3.21]: 


A A 
= =1 Joo) 2 
(r= da) {i -AL 1- I 


= (A, — AQ)7! { 


1 
(1 —A,L)- (1 =< A,L) 


Ai(l_ — A,L) - A,(1 — AL) 
(i =.0) = a) 


Thus, [2.3.21] can be written as 


A A 
= -~))7! {arenes Ie 
yr (Ay d2) {; = ALL 1- wale 


x 
= JL + AL + AZL?2 + RL + ++] 
ie = as 


aie [1 + AQL + AZL? + ABL3F- +--+ 1, 
Ay — A2 
or 
y, = [er + calw, + [cay + Codz]w,_1 + [c,A7 + c,A3]w,_2 
+ [c,A2 + c,A3]w,3 + °°, eae 
where 
C, = AMA, — A,) [2.3.24] 
Cy = —A,/(Ay — dg). [2.3.25] 


From [2.3.23] the dynamic multiplier can be read off directly as 


OVi4; 
ow, 


the same result arrived at in equations [1.2.24] and [1.2.25]. 


23 cA4 as Ab, 


2.4. pth-Order Difference Equations 


These techniques generalize in a straightforward way to a pth-order difference 
equation of the form 


Ve = PN-1 + Gaya to + PpYr-p + We [2.4.1] 
Write [2.4.1] in terms of lag operators as 
(1 — dL — gL? — +--+ — ob L?)y, = w,. [2.4.2] 
Factor the operator on the left side of [2.4.2] as 
(1 dL ~ Geb? - +++ 1?) = (1- ALLL = AL) (=A). [2.4.3] 
This is the same as finding the values of (A,, Az, . . . , Ap) such that the following 
polynomials are the same for all z: 
(1 — dz — $227 — +++ — Gz?) = (1 — Ayz)(1 — Agz) +++ (1 — A,2). 


2.4. pth-Order Difference Equations 33 


As in the second-order system, we multiply both sides of this equation by z~? and 
define A = 271: 
(a? _ AP? i ah? -? Tae bas: $,-1A _ cm) 

= (A — A)(A - Ad) (A= A). 
Clearly, setting A = A; fori = 1,2,... , or p causes the right side of [2.4.4] to 


equal zero. Thus the values (A;, A2,-.-; A,) must be the numbers that set the left 
side of expression [2.4.4] to zero as well: 


Ap — GdP71 = GaP? — ++ — hb = 02 [2.4.5] 


This expression again is identical to that given in Proposition 1.1, which charac- 
terized the eigenvalues (A;, Az, . . - , A,) of the matrix F defined in equation [1.2.3]. 
Thus, Proposition 2.1 readily generalizes. 


[2.4.4] 


Proposition 2.2: Factoring a pth-order polynomial in the lag operator, 
(1 - GL - dL? - +++ — @ pL?) = (1 ~ ALL - AL) ++ (1 - 4D), 
is the same calculation as finding the eigenvalues of the matrix F defined in [1.2.3]. 


The eigenvalues (A,, Az, -.. , Ap) of F are the same as the parameters (A,, Az, .. - 
A,) in [2.4.3] and are given by the solutions to equation [2.4.5], 


The difference equation [2.4.1] is stable if the eigenvalues (the roots of [2.4.5]) 
lie inside the unit circle, or equivalently if the roots of 


1 — $2 — gz? — +++ — bz? =0 [2.4.6] 
lie outside the unit circle. 

Assuming that the eigenvalues are inside the unit circle and that we are 
restricting ourselves to considering bounded sequences, the inverses (1 — A,L)7?, 
(1 - A,L)~',..., (1 — A,L)~? all exist, permitting the difference equation 

(1 — ALL — AL) +++ (l — ALL), = wy, 
to be written as 
Me = (L — AWL)“'(L — AQL)“2 + + + (1 — ALL) 3w,. [2.4.7] 


Provided further that the eigenvalues (A, Az, ... , Ap) are all distinct, the poly- 
nomial associated with the operator on the right side of [2.4.7] can again be ex- 
panded with partial fractions: 


1 
(1 — Ayz)(1 — Agz) ++ > (1 — A,2z) 


[2.4.8] 
= Cy 4 Gq A aod Cp 
(1 — Ayz) (1 -— Az) (1 — A,z)’ 


Following Sargent (1987, pp. 192-93), the values of (c,, cz, . . . , ¢,) that make [2.4.8] 
true can be found by multiplying both sides by (1 — A,z)(1 — Agz) ++ + (1 — AQ2): 
1 = e,(1 — Agz)(1 — Asz) +++ (1 — A,2z) 
+ (1 — Ayz)(L — Asz) + (1 — pz) tee [2.4.9] 
+ ¢,(1 — Ayz)(L — Agz)- ++ (1 — Ap-12). 


Equation [2.4.9] has to hold for all values of z. Since it is a (p — 1)th-order 
polynomial, if (¢,, c2, ..., ¢,) are chosen so that [2.4.9] holds for p particular 


34 Chapter 2 | Lag Operators 


distinct values of z, then [2.4.9] must hold for all z. To ensure that [2.4.9] holds 
at z = Aj} requires that 7 


1 = e(1 ~ A,AT*)(L — AsAz?) + + (L -— A,AT?) 


or 
Ago} 
C= OOOO 2.4.10 
Up a Oa) oe 
For [2.4.9] to hold for z = Az1, Ay1,..., AZ? requires 
Ago 
C33 ee eee 2.4,11 
2 Oa 0s a) Oa 2.4.11] 
ee ak See [2.4.12] 
Q, -_ A), = d,) ea (A, zit Ap-1) 


Note again that these are identical to expression [1.2.25] in Chapter 1. Recall from 
the discussion there thatc,; +c. +++: +c, = 1. 
To conclude, [2.4.7] can be written 


Cy C2 Cc 
a + ———— pte +e 
sna RON WF had ac Ce WS 2 a cre 
=e (1+A,L + AZL? + RLF +++ -)w,4+o,(1 + Agh + AZL? + ABLP +: + dw, 
+t eee ALE +221? + SLs hwy, 
or 
ye = [Cp + eg +++ + ow, + [car + cdg + ++ * + CAL]H,W1 
+ [c,A} + c2,AZ +--+ + c,AZ]w,_2 [2.4.13] 
+ [c,A? + cA} + +++ + cAZ]W,-3 + °° 
where (c,, C2,..., ¢,) are given by equations [2.4.10] through [2.4.12]. Again, 
the dynamic multiplier can be read directly off [2.4.13]: 


Mess = [ead + cdg +--+ + cad, [2.4.14] 
t 


reproducing the result from Chapter 1. 
There is a very convenient way to calculate the effect of w on the present 
value of y using the lag operator representation. Write [2.4.13] as 


Y, = Wow, + WW, + dew,2 + Wawi-3g toe [2.4.15] 
where 
bj = [Ad + GAL + +++ + GAS). [2.4.16] 
Next rewrite [2.4.15] in lag operator notation as 
y = o(L),, . [2.4.17] 
where ¢(L) denotes an infinite-order polynomial in the lag operator: 


WL) =o + wb t+ BL? + BL t::: 


2.4. pth-Order Difference Equations 35 


Notice that ¥, is the dynamic multiplier [2.4.14]. The effect of w, on the present 
value of y is given by 


[2.4.18] 


Thinking of y(z) as a polynomial in a real number z, 
WZ) = bot Wz t+ hz? + Werte, 
it appears that the multiplier [2.4.18] is simply this polynomial evaluated at z = 8: 
a ey BY, 4; 
j=0 


Ow, 


= ¥(B) = Yo + WB + WB? + WB +---. [2.4.19] 


But comparing [2.4.17] with [2.4.7], it is apparent that 
WL) = [(L — AL) - AQL)+ ++ (1 - AL), 
and from [2.4.3] this means that 


W(L) = [1 - $b — &L? - +++ - $L?}-*. 
We conclude that 
Wz) = [1 = diz — baz? - 2 — p27 
for any value of z, so, in particular, 
w(8) = [1 -— $18 — ¢:8? — +++ — $,B?)-’. [2.4.20] 
Substituting [2.4.20] into [2.4.19] reveals that 
92d, BY; 1 
‘Pn = ay a are ys , [2.4.21] 


reproducing the claim in Proposition 1.3. Again, the long-run multiplier obtains 
as the special case of [2.4.21] with B = 1: 


jim | Mead 4 Desk gy Bes] 1 
ow, OW, 44 OW j 


joes 


2.5. Initial Conditions and Unbounded Sequences 
Section 1.2 analyzed the following problem. Given a pth-order difference equation 


Ye = BY-1 + $2y:-2 tee oe dpYr-p + W,, [2.5.1] 
Pp initial values of y, 


Y-1) Y-2 eyes 1Y—p» [2.5.2] 
and a sequence of values for the input variable w, 
{wo, Wipe ees wi, [2.5.3] 


36 Chapter 2 | Lag Operators 


we sought to calculate the sequence of values for the output variable y: 


{yo, Yirees Vit 


Certainly there are systems where the question is posed in precisely this form. We 
may know the equation of motion for the system [2.5.1] and its current state [2.5.2] 
and wish to characterize the values that { yg, y,,. . . , y,} might take on for different 
specifications of {Wy, W,,..., W}. 

However, there are many examples in economics and finance in which a 
theory specifies just the equation of motion [2.5.1] and a sequence of driving 
variables [2.5.3]. Clearly, these two pieces of information alone are insufficient to 
determine the sequence {yg, y,,.. . , y,}, and some additional theory beyond that 
contained in the difference equation [2.5.1] is needed to describe fully the de- 
pendence of y on w. These additional restrictions can be of interest in their own 
right and also help give some insight into some of the technical details of manip- 
ulating difference equations. For these reasons, this section discusses in some depth 
an example of the role of initial conditions and their implications for solving dif- 
ference equations. 

Let P, denote the price of a stock and D, its dividend payment. If an investor 
buys the stock at date ¢ and sells it at t + 1, the investor will earn a yield of 
D,/P, from the dividend and a yield of (P,,, — P,)/P, in capital gains. The investor’s 
total return (r,,1) is thus 


Tr41 = (Prat _ PIP, + D,/P,. 


A very simple model of the stock market posits that the return investors earn on 
stocks is constant across time periods: 


r= (Pi, - PIP, + D/P, r>0. [2.5.4] 


Equation [2.5.4] may seem too simplistic to be of much practical interest; it 
assumes among other things that investors have perfect foresight about future stock 
prices and dividends. However, a slightly more realistic model in which expected 
stock returns are constant involves a very similar set of technical issues. The ad- 
vantage of the perfect-foresight model [2.5.4] is that it can be discussed using the 
tools already in hand to gain some further insight into using lag operators to solve 
difference equations. 

Multiply [2.5.4] by P, to arrive at 


rP, = P,,, — P, + D, 


or 


Pu, = (1 + OP, — D, [2.5.5] 
Equation [2.5.5] will be recognized as a first-order difference equation of the form 
of [1.1.1] with y, = P,.,@ = (1 + 7), and w, = —D,. From [1.1.7], we know 
that [2.5.5] implies that 
Pray = (1 + At Py — (1 + Dg — (1 + ID, — (1 + "2D, [2.5.6] 
—+++—(14+r)D,_1,— D, 
If the sequence {Dy, D,, ..., Dj} and the value of Py were given, then [2.5.6] 
could determine the values of {P,, P,,..., P,}. But if only the values {Dy, Dj, 
..., D} are given, then equation [2.5.6] would not be enough to pin down {P,, 
P,,...  P,,1}. There are an infinite number of possible sequences {P;, P2,..., 
P,,,} consistent with [2.5.5] and with a given {Dy, D,, ..., Dj. This infinite 
number of possibilities is indexed by the initial value Pp. 


2.5. Initial Conditions and Unbounded Sequences 37 


A further simplifying assumption helps clarify the nature of these different 
paths for {P,;, P2,..., P,, 1}. Suppose that dividends are constant over time: 


D, = D for all ¢. 
Then [2.5.6] becomes 


Pi4= (1+ ntiPy - [A + rn6+ d+ nc? 
tee t+(1Ltn+ 1D 
= (1 + nyttPy — ee D [2.5.7] 
= (1 + r)'*"[Py — (D/r)] + (Dir). 


Consider first the solution in which Py = D/r. If the initial stock price should 
happen to take this value, then [2.5.7] implies that 


P, = Dir [2.5.8] 


for all t. In this solution, dividends are constant at D and the stock price is constant 
at D/r. With no change in stock prices, investors never have any capital gains or 
losses, and their return is solely the dividend yield D/P = r. In a world with no 
changes in dividends this seems to be a sensible expression of the theory represented 
by [2.5.4]. Equation [2.5.8] is sometimes described as the “market fundamentals” 
solution to [2.5.4] for the case of constant dividends. 

However, even with constant dividends, equation [2.5.8] is not the only result 
consistent with [2.5.4]. Suppose that the initial price exceeded D/r: 


Py > Dir. 


Investors seem to be valuing the stock beyond the potential of its constant dividend 
stream. From [2.5.7] this could be consistent with the asset pricing theory [2.5.4] 
provided that P, exceeds D/r by an even larger amount. As long as investors all 
‘believe that prices will continue to rise over time, each will earn the required return 
r from the realized capital gain and [2.5.4] will be satisfied. This scenario has 
reminded many economists of a speculative bubble in stock prices. 

If such bubbles are to be ruled out, additional knowledge about the process 
for {P}#. _.. is required beyond that contained in the theory of [2.5.4]. For example, 
we might argue that finite world resources put an upper limit on feasible stock 
prices, as in 


|P|)<P  forallt. [2.5.9] 


Then the only sequence for {P,}*_ _.. consistent with both [2.5.4] and [2.5.9] would 
be the market fundamentals solution [2.5.8]. 

Let us now relax the assumption that dividends are constant and replace it 
with the assumption that {D}*__. is a bounded sequence. What path for 
{P}=. _. in [2.5.6] is consistent with [2.5.9] in this case? The answer can be found 
by returning to the difference equation [2.5.5]. We arrived at the form [2.5.6] by 
recursively substituting this equation backward. That is, we used the fact that [2.5.5] 
held for dates t,t — 1, — 2,. . . , 0 and recursively substituted to arrive at [2.5.6] 
as a logical implication of [2.5.5]. Equation [2.5.5] could equally well be solved 
recursively forward. To do so, equation [2.5.5] is written as 


1 
Po= 77 [Pi + Dd. [2.5.10] 


38 Chapter 2 | Lag Operators 


An analogous equation must hold for date t + 1: 


1 
Pist = Ty, [Prez + Devil [2.5.11] 
Substitute [2.5.11] into [2.5.10] to deduce 
1 1 
P= to oat a +0] 2.5.12] 
1 


H 
[ zeaceraceers | 
ee 
+ 
~ 
[ a | 
hy N 
na 
N 
+ 
| epee | 
_ 
+|- 
~ 
a | 
N 
& 
na 
+ 
"-, 
= 
+)rR 
~ 
| a | 
& 


Using [2.5.10] for date t + 2, 
1 
P.a2 = Tay [Piss ay Disa]; 
and substituting into [2.5.12] gives 


3 3 2 
1 1 1 1 
Ee F + | kaa F + | Peas F + | Ree ; + -|2. 
Continuing in this fashion T periods into the future produces 
T T T-1t 
1 1 1 
= |—— | Pir t+ || Dari + | Dia r- 
P, E + | Piyr ; re 4 +T-1 F aa | t+T-2 [2.5.13] 
1 1 
+(e. + [Aa)p. 


If the sequence {P,}7. _.. is to satisfy [2.5.9], then 


T 
: 1 
ig ] ee ae 


If {D}/— - = is likewise a bounded sequence, then the following limit exists: 


T 1 J+ 

ae 2, ; + | Disp 
Thus, if {P}7. ... is to be a bounded sequence, then we can take the limit of [2.5.13] 
as T— & to conclude 


on 1 i+ 
P, = » | a | Dei [2.5.14] 
which is referred to as the “market fundamentals” solution of [2.5.5] for the general 
case of time-varying dividends. Notice that [2.5.14] produces [2.5.8] as a special 
case when D, = D for all t. 

Describing the value of a variable at time ¢ as a function of future realizations 
of another variable as in [2.5.14] may seem an artifact of assuming a perfect- 
foresight model of stock prices. However, an analogous set of operations turns out 
to be appropriate in a system similar to [2.5.4] in which expected returns are 
constant.' In such systems [2.5.14] generalizes to 


‘2 i+t 
p= 3 | Bean 


j=0 
‘See Sargent (1987) and Whiteman (1983) for an introduction to the manipulation of difference 
equations involving expectations. 


1l+r 


2.5. Initial Conditions and Unbounded Sequences 39 


where E, denotes an expectation of an unknown future quantity based on infor- 
mation available to investors at date ¢. 

Expression [2.5.14] determines the particular value for the initial price Py that 
is consistent with the boundedness condition [2.5.9]. Setting t = 0 in [2.5.14] and 
substituting into [2.5.6] produces 


2 : 3 
1 1 1 
= t+ 
Pras aso [ oo + [A ]o + [4] D, 


1 t+1 1 t+2 
tee fee] o+(7] Dart} -a +n, 


-—(1 + rnUD, - d+ nD, - +--+ - + nd, - D, 


2 3 
1 1 1 
. Fesie + [FE] eat 4] Dees 


Thus, setting the initial condition Py to satisfy [2.5.14] is sufficient to ensure that 
it holds for allt. Choosing P, equal to any other value would cause the consequences 
of each period’s dividends to accumulate over time so as to lead to a violation of 
[2.5.9] eventually. 

It is useful to discuss these same calculations from the perspective of lag 
operators. In Section 2.2 the recursive substitution backward that led from [2.5.5] 
to [2.5.6] was represented by writing [2.5.5] in terms of lag operators as 


[b= @ + NEP = —D, [2.5.15] 
and multiplying both sides of [2.5.15] by the following operator: 
(I+ (+L 4+ (1 + Pb? +- +. + (1 t rJLi. [2.5.16] 


If (1 + r) were less than unity, it would be natural to consider the limit of [2.5.16] 
as t—> 


[1-4 tpt =14+ (+L + (t+ ress, 


In the case of the theory of stock returns discussed here, however, r > 0 and 
this operator is not defined. In this case, a lag operator representation can be 
sought for the recursive substitution forward that led from [2.5.5] to [2.5.13]. This 
is accomplished using the inverse of the lag operator, 

L714, = Wray 


which extends result [2.1.4] to negative values of k. Note that L~' is indeed the 
inverse of the operator L: 


L-\(Lw,) = L7 ws, = 
In general, 
L-FLi = Li-*, 
with L° defined as the identity operator: 
L°w, = w,. 


40 Chapter 2 | Lag Operators 


Now consider multiplying [2.5.15] by 
(V+ (Ltr) bot + (dtr bn2 t+ + 4h t ry O-PL-@ yy 2.5.17] 
x[-GQ+nUL yy 


to obtain 
(1+ (1 + rp bt + (2 + ry7h-2 + ee + Lt A -F-PL-G-] 
xf1-(Q+rn UL yes 
H=(lL+ (tn ULt+ dtr UL zg: 
+ (1 + r)7F-DL-F-Y} x (1 + 71D, 


or 


2 
[1 pe (1 + r)-TL- P44 = Fest + Fesd Disa 


i 7 al ie 
+ [ep ens t+ [A] D7, 


which is identical to [2.5.13] with ¢ in [2.5.13] replaced with ¢ + 1. 

When r > 0 and {P}% _.. is a bounded sequence, the left side of the preceding 
equation will approach P,,, as T becomes large. Thus, when r > 0 and {P}7_ _.. 
and {Dj} _. are bounded sequences, the limit of the operator in [2.5.17] exists 
and could be viewed as the inverse of the operator on the left side of [2.5.15]: 


(1-4 nL} = -Gdtn UL 
x [Ll + (1 + rho? + (1 + rr) 772 +], 

Applying this limiting operator to [2.5.15] amounts to solving the difference equa- 
tion forward as in [2.5.14] and selecting the market fundamentals solution among 
the set of possible time paths for {P,}* _.. given a particular time path for dividends 
{Djia - =. 

Thus, given a first-order difference equation of the form 

(1 — oL)y, = w,, [2.5.18] 


Sargent’s (1987) advice was to solve the equation “backward” when |¢| < 1 by 
multiplying by 


(1 -— @L]-? = [1 + dL + @7L?7 + PLP +--+] [2.5.19] 
and to solve the equation “forward” when |¢| > 1 by multiplying by 
-@-1L7} 
1- 6L}-' = ——— 
[~ on] 1-o-*L~* [2.5.20] 


= - PUL "1+ @'L1+ 6 7L-274+ p3L +: -). 


Defining the inverse of [1 — @L] in this way amounts to selecting an operator 
[1 — @L]~! with the properties that 


(1 — @L)-* x [1 — @L] =1 (the identity operator) 
and that, when it is applied to a bounded sequence {w} _.., 
[1 ae oL}"'w,, 


the result is another bounded sequence. 
The conclusion from this discussion is that in applying an operator such as 
[1 — @L]-', we are implicitly imposing a boundedness assumption that rules out 


2.5. Initial Conditions and Unbounded Sequences 41 


phenomena such as the speculative bubbles of equation [2.5.7] a priori. Where 
that is our intention, so much the better, though we should not apply the rules 
[2.5.19] or [2.5.20] without some reflection on their economic content. 


Chapter 2 References 


Sargent, Thomas J. 1987. Macroeconomic Theory, 2d ed. Boston: Academic Press. 


Whiteman, Charles H. 1983. Linear Rational Expectations Models: A User’s Guide. Min- 
neapolis: University of Minnesota Press. 


42 Chapter 2 | Lag Operators 


Stationary 
ARMA Processes 


This chapter introduces univariate ARMA processes, which provide a very useful 
class of models for describing the dynamics of an individual time series. The chapter 
begins with definitions of some of the key concepts used in time series analysis. 
Sections 3.2 through 3.5 then investigate the properties of various A RMA processes. 
Section 3.6 introduces the autocovariance-generating function, which is useful for 
analyzing the consequences of combining different time series and for an under- 
standing of the population spectrum. The chapter concludes with a discussion of 
invertibility (Section 3.7), which can be important for selecting the ARMA rep- 
resentation of an observed time series that is appropriate given the uses to be made 
of the model. 


3.1. Expectations, Stationarity, and Ergodicity 


Expectations and Stochastic Processes 
Suppose we have observed a sample of size T of some random variable Y,: 


{Y1, Ya ++ +> Yee [3.1.1] 


For example, consider a collection of T independent and identically distributed 
(i.i.d.) variables ¢,, 


{é1, > ex}, [3.1.2] 
with 
e, ~ N(0, o). 


This is referred to as a sample of size T from a Gaussian white noise process. 

The observed sample [3.1.1] represents T particular numbers, but this set of 
T numbers is only one possible outcome of the underlying stochastic process that 
generated the data. Indeed, even if we were to imagine having observed the process 
for an infinite period of time, arriving at the sequence 


{yHa-2 =f... > Y—ir Yor Yt» Yao +o Yrs Yrats rans - hs 


the infinite sequence {y,}7_ _,. would still be viewed as a single realization from a 
time series process. For example, we might set one computer to work generating 
an infinite sequence of i.i.d. N(0, 0?) variates, {ef }% _., and a second computer 
generating a separate sequence, {e%}*_.. We would then view these as two 
independent realizations of a Gaussian white noise process. 


43 


Imagine a battery of J such computers generating sequences {yM}*. _., 
{ye 2, . »{yP}% x, and consider selecting the observation associated with 
date ¢ from each sequence: 


{yf ¥?,..., WP}. 


This would be described as a sample of J realizations of the random variable Y,. 
This random variable has some density, denoted fy,(y,), which is called the un- 
conditional density of Y,. For example, for the Gaussian white noise process, this 
density is given by 


ape cere ed 
fry) = Vine or 292 | 


The expectation of the tth observation of a time series refers to the mean of 
this probability distribution, provided it exists: 


E(Y) = | ysfuly) dye 3.1.3] 


We might view this as the probability limit of the ensemble average: 
E(¥,) = plim (1/1) > Y¥?. [3.1.4] 
For example, if {Y,}. _. represents the sum of a constant « plus a Gaussian white 
noise process {e,}7_ _ x, 
Y,= p+ &, [3.1.5] 
then its mean is 
E(Y,) = » + E(e) = pe. [3.1.6] 
If Y, is a time trend plus Gaussian white noise, 
| Y, = Bt + &, [3.1.7] 
then its mean is 
E(Y,) = Bt. [3.1.8] 


Sometimes for emphasis the expectation E(Y,) is called the unconditional 
mean of Y,. The unconditional mean is denoted p,: 


E(Y,) = Be 


Note that this notation allows the general possibility that the mean can be a function 
of the date of the observation ¢. For the process [3.1.7] involving the time trend, 
the mean [3.1.8] is a function of time, whereas for the constant plus Gaussian white 
noise, the mean [3.1.6] is not a function of time. 

The variance of the random variable Y, (denoted ,) is similarly defined as 


Yor = E(Y, — by)? = it (y, — #4)? fey) dy [3.1.9] 


44 Chapter 3 | Stationary ARMA Processes 


For example, for the process [3.1.7], the variance is 
Yo = E(Y, — Bt? = E(€?) = 07, 


Autocovariance 


Given a particular realization such as {y™}*__. on a time series process, 
consider constructing a vector x) associated with date ¢. This vector consists of 
the [j + 1] most recent observations on y as of date ¢ for that realization: 


yf 
(1) 
Yimt 
xMel, 
Q), 
tj 
We think of each realization {y,}% _. as generating one particular value of the 
vector x, and want to calculate the probability distribution of this vector x‘ across 
realizations i. This distribution is called the joint distribution of (Y,, Y,-1,..., 
Y,-,). From this distribution we can calculate the jth autocovariance of Y, (denoted 


Yj): 
= x as ie (Xe — M- 7 — Mey) 


x Fev ¥r- Jo Matos ero y,-)) dy, dy, oni dy,_; [3.1.10] 

E(Y, ~ MY; _ Hy~)): 

Note that [3.1.10] has the form of a covariance between two variables X and Y: 
Cov(X, Y) = E(X — wx)(¥ - my). 


Thus [3.1.10] could be described as the covariance of Y, with its own lagged value; 
hence, the term “autocovariance.” Notice further from [3.1.10] that the Oth au- 
tocovariance is just the variance of Y,, as anticipated by the notation +, in [3.1.9]. 

The autocovariance y,, can be viewed as the (1, j + 1) element of the variance- 
covariance matrix of the vector x,. For this reason, the autocovariances are de- 
scribed as the second moments of the process for Y,,. 

Again it may be helpful to think of the jth autocovariance as the probability 
limit of an ensemble average: 


Vit 


i 
% = plim (1/2) pm (YP - wJ-(¥2, - 4, [3.1.11] 
As an example of calculating autocovariances, note that for the process in 
[3.1.5] the autocovariances are all zero for j # 0: 


% = EY, — w)\(%-; — ») = E(ee,-j) = 0 for j # 0. 


Stationarity 


If neither the mean y, nor the autocovariances y,, depend on the date ¢, then 
the process for Y, is said to be covariance-stationary or weakly stationary: 


E(Y,) = for all t 
E(Y, — »)(¥%,-; — #) = y, for all ¢ and any j. 


3.1, Expectations, Stationarity, and Ergodicity 45 


For example, the process in [3.1.5] is covariance-stationary: 
E(Y,) = ph 


E(Y, - BMY; -B= { 


o* = forj = 0 


0 for j # 0. 


By contrast, the process of [3.1.7] is not covariance-stationary, because its mean, 
Bt, is a function of time. 

Notice that if a process is covariance-stationary, the covariance between Y, 
and Y,_, depends only on j, the length of time separating the observations, and 
not on ¢t, the date of the observation. It follows that for a covariance-stationary 
process, y; and y_; would represent the same magnitude. To see this, recall the 
definition 


y = E(¥, — w)\(V-j - #)- [3.1.12] 


If the process is covariance-stationary, then this magnitude is the same for any 
value of ¢t we might have chosen; for example, we can replace ¢ with ¢ + /: 


¥ = EY; — BY peg-7 — H) = Ei; — OY — B= EY, -— WY; — 


But referring again to the definition [3.1.12], this last expression is just the definition 
of y_;. Thus, for any covariance-stationary process, 


¥ = y-; for all integers j. [3.1.13] 


A different concept is that of strict stationarity. A process is said to be strictly 
stationary if, for any values of j,, j2,... , j,, the joint distribution of (Y,, Y,4;,, 
Yiaj- ++» Yra;,) depends only on the intervals separating the dates (j;, j2,..., 
j,) and not on the date itself (¢). Notice that if a process is strictly stationary with 
finite second moments, then it must be covariance-stationary—if the densities over 
which we are integrating in [3.1.3] and [3.1.10] do not depend on time, then the 
moments yt, and y,, will not depend on time. However, it is possible to imagine a 
process that is covariance-stationary but not strictly stationary; the mean and au- 
tocovariances could not be functions of time, but perhaps higher moments such as 
E(Y3) are. 

In this text the term “stationary” by itself is taken to mean “covariance- 
Stationary.” 

A process {Y,} is said to be Gaussian if the joint density 


Frnvien oreee Vien (er rss ede J Yori.) 


is Gaussian for any j,,-j2, .. . ,jJ,- Since the mean and variance are all that are 
needed to parameterize a multivariate Gaussian distribution completely, a covariance- 
stationary Gaussian process is strictly stationary. 


Ergodicity 
We have viewed expectations of a time series in terms of ensemble averages 
such as [3.1.4] and [3.1.11]. These definitions may seem a bit contrived, since 
usually all one has available is a single realization of size T from the process, which 
we earlier denoted {y{, y$?, .. . , y@}. From these observations we would cal- 
culate the sample mean Y. This, of course, is not an ensemble average but rather 
a time average: 
T 
y= (WT) Dy’. [3.1.14] 


tml 


46 Chapter 3 | Stationary ARMA Processes 


Whether time averages such as [3.1.14] eventually converge to the ensemble concept 
E(Y,) for a stationary process has to do with ergodicity. A covariance-stationary 
process is said to be ergodic for the mean if [3.1.14] converges in probability to 
E(Y,) as T-> ~.! A process will be ergodic for the mean provided that the auto- 
covariance , goes to zero sufficiently quickly as j becomes large. In Chapter 7 we 
will see that if the autocovariances for a covariance-stationary process satisfy 


2, ln <, [3.1.15] 


then {Y,} is ergodic for the mean, 
Similarly, a covariance-stationary process is said to be ergodic for second 
moments if 


T 

- P 

(MT- DD %- wWe-j- wy 9 
mcd 

for all j. Sufficient conditions for second-moment ergodicity will be presented in 

Chapter 7. In the special case where {Y,} is a stationary Gaussian process, condition 

[3.1.15] is sufficient to ensure ergodicity for all moments. 

For many applications, stationarity and ergodicity turn out to amount to the 
same requirements. For purposes of clarifying the concepts of stationarity and 
ergodicity, however, it may be helpful to consider an example of a process that is 
stationary but not ergodic. Suppose the mean yp for the ith realization 
{yy} _.. is generated from a N(0, A) distribution, say 

Y? =p + «,, [3.1.16] 
Here {e,} is a Gaussian white noise process with mean zero and variance o? that 
is independent of 4. Notice that 

My = E(u) + E(e,) = 0. 
Also, 
Yu = E(u” + 6)? =X + 0? 
and 
Ye = E(u + €)(e + &_)) = A? forj #0. 

Thus the process of [3.1.16] is covariance-stationary. It does not satisfy the sufficient 
condition [3.1.15] for ergodicity for the mean, however, and indeed, the time average 


T T T 
(VT) >) YO = (UT) D (HO + &) = wh + (UT) D e, 
tml tml t=l1 


converges to 4 rather than to zero, the mean of Y,. 


3.2. White Noise 


The basic building block for all the processes considered in this chapter is a sequence 
{e}*. ... whose elements have mean zero and variance o”, 


E(e,) = 0 [3.2.1] 
E(«?) = o°, [3.2.2] 
and for which the e’s are uncorrelated across time: 


‘Often “ergodicity” is used in a more general sense; see Anderson and Moore (1979, p. 319) or 
Hannan (1970, pp. 201-20). 


3.2. White Noise 47 


E(e,e,) = 0 fort #7. [3.2.3] 


A process satisfying [3.2.1] through [3.2.3] is described as a white noise process. 
We shall on occasion wish to replace [3.2.3] with the slightly stronger condition 
that the e’s are independent across time: 


€,, €, independent for ¢ # 7. [3.2.4] 


Notice that [3.2.4] implies [3.2.3] but [3.2.3] does not imply [3.2.4]. A process 
satisfying [3.2.1] through [3.2.4] is called an independent white noise process. 
Finally, if [3.2.1] through [3.2.4] hold along with 


e, ~ N(O, o°), [3.2.5] 


then we have the Gaussian white noise process. 


3.3. Moving Average Processes 


The First-Order Moving Average Process 
Let {e,} be white noise as in [3.2.1] through [3.2.3], and consider the process 
Y, = wt €, + 06-1, (3.3.1) 
where y and @ could be any constants. This time series is called a first-order moving 
average process, denoted MA(1). The term “moving average” comes from the fact 
that Y, is constructed from a weighted sum, akin to an average, of the two most 


recent values of €. 
The expectation of Y, is given by 


E(Y,) = E(u + & + 6€-1) =e + Ele) + @E(e-1) = w. [3.3.2] 


We used the symbol y for the constant term in [3.3.1] in anticipation of the result 
that this constant term turns out to be the mean of the process. 
The variance of Y, is 


E(Y, - a)? 


Ht 


E(e, + 0€,~1)" 

E(e? + 26¢,e,-; + 67€?_,) [3.3.3] 
o? + 0 + Go? 

= (1 + 6)o7. 


The first autocovariance is 
E(Y, ~ HY -1 ~ 7) = E(e, ay Ge, 1)(E—-1 + 6€,_2) 
E(€,6,-1 + 067. + 06,62 + 67€,_1€,-2) [8.3.4] 
= 0+ 607+0+0. 
Higher autocovariances are all zero: 
E(Y, — »)(¥,-; — w) = Ele, + 6€,-1)(-, + 9€-;-1) =0 forj>1. [3.3.5] 


Since the mean and autocovariances are not functions of time, an MA(1) process 
is covariance-stationary regardless of the value of 6. Furthermore, [3.1.15] is clearly 
satisfied: 


> ly| = (1 + 6%)0? + [607 
j=0 


Thus, if {e,} is Gaussian white noise, then the MA(1) process [3.3.1] is ergodic for 
all moments. 


48 Chapter 3 | Stationary ARMA Processes 


The jth autocorrelation of a covariance-stationary process (denoted p;) is de- 
fined as its jth autocovariance divided by the variance: 


2; = ¥;!/Yo- [3.3.6] 


Again the terminology arises from the fact that p, is the correlation between Y, 
and Y,_;: 
Cont, ¥.-) = Geeepe gts = al = 
ae, Var(Y,) VVar(Y,_,) %V% 

Since p; is a correlation, |p,| = 1 for all /, by the Cauchy-Schwarz inequality. Notice 
also that the Oth autocorrelation py is equal to unity for any covariance-stationary 
process by definition. 

From [3.3.3] and [3.3.4], the first autocorrelation for an MA(1) process is 
given by 

___ 607 _—— 8 
A U+ Ao (+ 6) 

Higher autocorrelations are all zero. 

The autocorrelation p, can be plotted as a function of j as in Figure 3.1. Panel 


(a) shows the autocorrelation function for white noise, while panel (b) gives the 
autocorrelation function for the MA(1) process: 


[3.3.7] 


Y, = €, + 0.8¢,_ 1. 


For different specifications of 6 we would obtain different values for the first 
autocorrelation p, in [3.3.7]. Positive values of 6 induce positive autocorrelation 
in the series. In this case, an unusually large value of Y, is likely to be followed 
by a larger-than-average value for Y,,,, just as a smaller-than-average Y, may well 
be followed by a smaller-than-average Y,,,. By contrast, negative values of 6 imply 
negative autocorrelation—a large Y, might be expected to be followed by a small 
value for Y,,1. 

The values for p, implied by different specifications of 6 are plotted in Figure 
3.2. Notice that the largest possible value for p; is 0.5; this occurs if @ = 1. The 
smallest value for p, is —0.5, which occurs if @ = —1. For any value of p, between 
—0.5 and 0.5, there are two different values of @ that could produce that auto- 
correlation. This is because the value of 6/(1 + 6?) is unchanged if 6 is replaced 
by 1/6: 


M1 > m2 2211 a (1A) 


For example, the processes 

Y,= €,+ 0.56, 
and 

Y, = €, + 2&4 
would have the same autocorrelation function: 


2 0.5 


We will have more to say about the relation between two MA(1) processes that 
share the same autocorrelation function in Section 3.7. 


3.3. Moving Average Processes 49 


6 ry 
fe, 0 rr 20 ar 10 33 
Log (1) Log (i) 
(a) White noise: Y, = «, (b) MA(1): Y, = e, + 0.8¢,_, 
a2 2 
° ° 
Aa G 0 20 asa 10 2 
Log (i) Log (i) 


(c) MA(4): Y, = e, — 0.6€,_, + 0.3-2 (d) AR(1): Y, = 0.8¥,-1 + &, 
— 0.56.5 + 0.564 


12 


12 


o 20 


Lon () 
(e) AR(1): Y, = —0.8Y,., + & 
FIGURE 3.1 Autocorrelation functions for assorted ARMA processes. 


The qth-Order Moving Average Process 
A qth-order moving average process, denoted MA(q), is characterized by 
Y, = wt & + 821 + 06-2 +°°° + O89 [3.3.8] 


where {e,} satisfies [3.2.1] through [3.2.3] and (0,, @,,. . . , 8,) could be any real 
numbers, The mean of [3.3.8] is again given by p: 


E(Y,) = w+ E(é,) + 6,-E(e,-1) + 62°E(e,-2) +++ + + 07° E(e_4) = 
The variance of an MA(q) process is 


Yo = E(Y, — mw)? = Ele, + 016.1 + O28)-2 ++ °° + Og8—¢)?- [3.3.9] 


50 Chapter 3 | Stationary ARMA Processes 


-1 


FIGURE 3.2 The first autocorrelation (p,) for an MA(1) process possible for 
different values of 6. 


Since the e’s are uncorrelated, the variance [3.3.9] is? 
Yo = 07 + Ojo? + O307 +--+ + Bor = (1+ OF + OF +:+++62)o%. [3.3.10] 
Forj = 1,2,...,4q, 
y, = El(e, + 6,1 + 026-2 +° ++ + 09-4) 
x (€,-; + O16,-j-1 + O2€,-;-2 +++ + O58,_j-4)] [3.3.11] 
= E[66?7_; + 0;4191€7-;-1 + 94 202€7_;-2 + ++ + + 0,0,_;€7_4].- 
Terms involving e’s at different dates have been dropped because their product 
has expectation zero, and 6p is defined to be unity. For / > q, there are no e’s with 
common dates in the definition of y,, and so the expectation is zero. Thus, 


_ 1K. + 64101 + 4202 +°°* + 0,0,-jJ:0? forj=1,2,....4 
f 0 forj>q. [3.3.12] 
For example, for an MA(2) process, 

Yo = [1 + 6 + 6-0? 

Y1 = [6 + 6,0,]-0? 


v2 = [6al-0" 
3= Y= = 0. 
For any values of (6,, 2, ..., 0,), the MA(q) process is thus covariance- 


stationary. Condition [3.1.15] is satisfied, so for Gaussian ¢, the MA(q) process is 
also ergodic for all moments. The autocorrelation function is.zero after q lags, as 
in panel (c) of Figure 3.1. 


The Infinite-Order Moving Average Process 
The MA(q) process can be written 


q 
Y,=et >, 0;E,~; 
i=0 
2See equation (A.5.18] in Appendix A at the end of the book. 


3.3. Moving Average Processes 51 


with 6) = 1. Consider the process that results as g — ~: 
Y,= et », WE 7 = M+ Wok + Wen + Yob-2 tre. [3.3.13] 
iz 


This could be described as an MA() process. To preserve notational flexibility 
later, we will use #’s for the coefficients of an infinite-order moving average process 
and 6’s for the coefficients of a finite-order moving average process. 

Appendix 3.A to this chapter shows that the infinite sequence in [3.3.13] 
generates a well defined covariance-stationary process provided that 


oe 


DY <a. [3.3.14] 


j=0 


It is often convenient to work with a slightly stronger condition than [3.3.14]: 
2 lil <=. [3.3.15] 
i= 


A sequence of numbers {y}7o satisfying [3.3.14] is said to be square summable, 
whereas a sequence satisfying [3.3.15] is said to be absolutely summable. Absolute 
summability implies square-summability, but the converse does not hold—there 
are examples of square-summablé sequences that are not absolutely summable 
(again, see Appendix 3.A). \ 

The mean and autocovariances of an MA() process with absolutely sum- 
mable coefficients can be calculated from a simple extrapolation of the results for 
an MA(q) process:> 


E(Y,) = lim E(u + Yote + Viera + vata t+ + bréi-7) 33.16] 
yo = E(Y, - mw)? 
> bulk E (Woe, + Wrer—1 + Yrer-2 + °° + bre—1)? [3.3.17] 
= lim (y3 + we + we +--+ + Hh)? 
Tox 


y = E(¥, — whY(%-; — 4) 
o7 (ey do + byt + Grate + bass + °° °)- 


Moreover, an MA() process with absolutely summable coefficients has absolutely 
summable autocovariances: 


[3.3.18] 


» ly|<~. [3.3.19] 
j=0 
Hence, an MA() process satisfying [3.3.15] is ergodic for the mean (see Appendix 
3.A). If the e’s are Gaussian, then the process is ergodic for all moments. 
3Absolute summability of {y;}7. and existence of the second moment E(e?) are sufficient conditions 


to permit mterchanging the order of integration and summation. Specifically, if {X7}F., is a sequence 
of random variables such that 


D E|X1<», 
T=1 


then 


| > x,| = S E(X;). 


Tat 


See Rao (1973, p. 111). 


52 Chapter 3 | Stationary ARMA Processes 


3.4. Autoregressive Processes 


The First-Order Autoregressive Process 


A first-order autoregression, denoted AR(1), satisfies the following difference 
equation: 


Y,=c+ $Y,1+6. [3.4.1] 


Again, {e,} is a white noise sequence satisfying [3.2.1] through [3.2.3]. Notice that 
[3.4.1] takes the form of the first-order difference equation [1.1.1] or [2.2.1] in 
which the input variable w, is given by w, = c + e,. We know from the analysis 
of first-order difference equations that if |¢| = 1, the consequences of the e’s for 
Y accumulate rather than die out over time. It is thus perhaps not surprising that 
when |@| = 1, there does not exist a covariance-stationary process for Y, with finite 
variance that satisfies [3.4.1], In the case when || < 1, there is a covariance- 
stationary process for Y, satisfying [3.4.1]. It is given by the stable solution to [3.4.1] 
characterized in [2.2.9]: 


Y,=(cte) + ¢(ct+e-1) + (c+ &-) + O(C+e-3)+°°° 
= [c/(1 — $)] + €&, + be,_, + $6,2+ Pe_3t °°. 


This can be viewed as an MA() process as in [3.3.13] with y; given by ¢/. When 
|¢| < 1, condition [3.3.15] is satisfied: 


ly = Del, 
j/=0 /=0 


which equals 1/(1 — ||) provided that |¢| < 1. The remainder of this discussion 
of first-order autoregressive processes assumes that |¢| < 1. This ensures that the 
MA() representation exists and can be manipulated in the obvious way, and that 
the AR(1) process is ergodic for the mean. 

Taking expectations of [3.4.2], we see that 


E(Y,) =[cl(1— $)] +0+0+---, 


so that the mean of a stationary AR(1) process is 


[3.4.2] 


w=ci(1- 9). [3.4.3] 
The variance is 
% = E(Y,- ph)? 
= E(e,+ be,-, + Ge,-2 + $°E,-3 sree ‘~ [3.4.4] 
=(1+¢7+ G++ ¢% +: . ‘)-o? 
= os(1 -_ ¢”), 


while the jth autocovariance is 
y= EY, — w)(Y-; — 4) 
= Ele, + $&-1 + Gea +°**+ bE.) + bitte, 5-1 
+ pie; te +] xX [apt G6-j-1+ Pe_j-2t+-° 7] [3.4.5] 
= [pi + pitt + hitt+-- ‘}-o? : 
=@ [1+ ¢? + G+: a a 
= [4/1 — ¢?)]-o?. 


3.4. Autoregressive Processes 53 


It follows from [3.4.4] and [3.4.5] that the autocorrelation function, 
2; = ¥j/% = !, [3.4.6] 


follows a pattern of geometric decay as in panel (d) of Figure 3.1. Indeed, the 
autocorrelation function [3.4.6] for a stationary AR(1) process is identical to the 
dynamic multiplier or impulse-response function [1.1.10]; the effect of a one-unit 
increase in ¢, on Y,,, is equal to the correlation between Y, and Y,,,. A positive 
value of ¢, like a positive value of @ for an MA(1) process, implies positive cor- 
relation between Y, and Y,,,. A negative value of ¢ implies negative first-order 
but positive second-order autocorrelation, as in panel (e) of Figure 3.1. 

Figure 3:3 shows the effect on the appearance of the time series {y,} of varying 
the parameter @. The panels show realizations of the process in [3.4.1] with c = 0 
and e, ~ N(0, 1)-for different values of the autoregressive parameter ¢. Panel (a) 
displays white noise (@ = 0). A series with no autocorrelation looks choppy and 
patternless to the eye; the value of one observation gives no information about the 
value of the next observation. For ¢ = 0.5 (panel (b)), the series seems smoother, 
with observations above or below the mean often appearing in clusters of modest 
duration. For @ = 0.9 (panel (c)), departures from the mean can be quite pro- 
longed; strong shocks take considerable time to die out. 

The moments for a stationary AR(1) were derived above by viewing it as an 
MA() process. A second way to arrive at the same results is to assume that the 
process is covariance-stationary and calculate the moments directly from the dif- 
ference equation [3.4.1]. Taking expectations of both sides of [3.4.1], 


E(Y,) =c + PE(Y,_1) + E(e). [3.4.7] 
Assuming that the process is covariance-stationary, 
E(Y) = E(¥,-1) = uo [3.4.8] 


Substituting [3.4.8] into [3.4.7], 
hM=c+ gue+0 
or 


pw = cl(1 — 9), [3.4.9] 


reproducing the earlier result [3.4.3]. 

Notice that formula [3.4.9] is clearly not generating a sensible statement if 
|¢| = 1. For example, if c > 0 and ¢ > 1, then Y, in [3.4.1] is equal to a positive 
constant plus a positive number times its lagged value plus a mean-zero random 
variable. Yet [3.4.9] seems to assert that Y, would be negative on average for such 
a process! The reason that formula [3.4.9] is not valid when |¢| = 1 is that we 
assumed in [3.4.8] that Y, is covariance-stationary, an assumption which is not 
correct when |¢| = 1. 

To find the second moments of Y, in an analogous manner, use [3.4.3] to 
rewrite [3.4.1] as 


Y,= wl — $) + OY-1 + & 
or 
(¥,~ #) = O(¥%-1 — w) + &. [3.4.10] 
Now square both sides of [3.4.10] and take expectations: 
E(Y, — a)? = @E(¥,-1 — mw)? + 2gE[(¥,-1 — wed + E(e?). [3.4.11] 


54 Chapter 3 | Stationary ARMA Processes 


(a) ¢ = 0 (white noise) 


(b) = 0.5 


(c) ¢ = 0.9 


FIGURE 3.3 Realizations of an AR(1) process, Y, = @Y,_; + &,, for alternative 
values of ¢. 


3.4, Autoregressive Processes 55 


Recall from [3.4.2] that (Y,.. — ») is a linear function of €,_,, &-2,...: 


(Y,-1 — H) = €&-, + P&-2 + $7e,_3 hy tas 


But ¢, is uncorrelated with ¢,_,, & 2, ..., SO € must be uncorrelated with 
(Y;-1 — @). Thus the middle term on the right side of [3.4.11] is zero: 
E((y,-1 — we] = 0. [3.4.12] 
Again, assuming covariance-stationarity, we have 
E(Y, — »)? = E(¥,-1 — wy? = v.- [3.4.13] 


Substituting [3.4.13] and [3.4.12] into [3.4.11], 
Yo = ¢'¥% + 0 +o? 
or 
Yo = oi(1 — ¢$°), 


reproducing [3.4.4]. 
Similarly, we could multiply [3.4.10] by (Y,_; — ) and take expectations: 


E((Y, — »)(¥,-; — »)] 
= PEl(¥,-1 — w)(¥.-; — w)] + Ele(¥,-; — ow). 


But the term (Y,_; — qu) will be a linear function of €,_;, €:j-1, &:-j-2. - + +> 
which, for j > 0, will be uncorrelated with e¢,. Thus, for j > 0, the last term on the 
right side in [3.4.14] is zero. Notice, moreover, that the expression appearing in 
the first term on the right side of [3.4.14], 


E((¥,-1 — #)(¥-; — »)], 
is the autocovariance of observations on Y separated by j — 1 periods: 
E((Y-1 - BY pea pj — p= Yj-1- 
Thus, for j > 0, [3.4.14] becomes 
Y= bY;-1- [3.4.15] 
Equation [3.4.15] takes the form of a first-order difference equation, 


[3.4.14] 


Ye = PYr-1 + Wes 
in which the autocovariance y takes the place of the variable y and in which the 
subscript j (which indexes the order of the autocovariance) replaces ¢ (which indexes 
time). The input w, in [3.4.15] is identically equal to zero. It is easy to see that the 
difference equation [3.4.15] has the solution 
y = $/Yo, 


which reproduces [3.4.6]. We now see why the impulse-response function and 
autocorrelation function for an AR(1) process coincide—they both represent the 
solution to a first-order difference equation with autoregressive parameter ¢, an 
initial value of unity, and no subsequent shocks. 


The Second-Order Autoregressive Process 
A second-order autoregression, denoted AR(2), satisfies 


Y,=c + &hY,-1 + &Y;-2 + &, [3.4.16] 


56 Chapter 3 | Stationary ARMA Processes 


or, in lag operator notation, 


(1 — $,L - @L4Y, = c + «, [3.4.17] 
The difference equation [3.4.16] is stable provided that the roots of 
(1 — $2 — $2?) = 0 [3.4.18] 


lie outside the unit circle. When this condition is satisfied, the AR(2) process turns 
out to be covariance-stationary, and the inverse of the autoregressive operator in 
[3.4.17] is given by 


WL) =(1- OL - @L’)' = + hb + wl? + wb +--+. [3.4.19] 


Recalling [1.2.44], the value of %, can be found from the (1, 1) element of the 
matrix F raised to the jth power, as in expression [1.2.28]. Where the roots of 
[3.4.18] are distinct, a closed-form expression for y; is given by [1.2.29] and [1.2.25]. 
Exercise 3.3 at the end of this chapter discusses alternative algorithms for calculating 
Uy. 

Multiplying both sides of [3.4.17] by ¥(L) gives 


Y, = p(Lc + y(L)e,. [3.4.20] 
It is straightforward to show that 
w(L)c = cl — b — d) [3.4.21] 
and 
> Wl < [3.4.22] 
it 


the reader is invited to prove these claims in Exercises 3.4 and 3.5. Since [3.4.20] 
is an absolutely summable MA() process, its mean is given by the constant term: 


w= cil - d — gy). [3.4.23] 


An alternative method for calculating the mean is to assume that the process 
is covariance-stationary and take expectations of [3.4.16] directly: 


E(Y,) =crt $i E(Y,-1) + $2 E(Y,-2) + E(é,), 
implying 
B=ct+ out Gp + 0, 


reproducing [3.4.23]. 
To find second moments, write [3.4.16] as 


Y, = w(1 — @& — &) + Yi + Go¥-2 + & 


or 
(¥, — w) = (%-1 — #) + b:(¥-2 — Bw) + &. [3.4.24] 
Multiplying both sides of [3.4.24] by (Y,_; ~ “) and taking expectations produces 
% = biyj-1+ &y-2 forj=1,2,.... [3.4.25] 


Thus, the autocovariances follow the same second-order difference equation as 
does the process for Y,, with the difference equation for y, indexed by the lag j. 
The autocovariances therefore behave just as the solutions to the second-order 
difference equation analyzed in Section 1.2. An AR(2) process is covariance- 
stationary provided that ¢, and ¢, lie within the triangular region of Figure 1.5. 


3.4. Autoregressive Processes 57 


When @, and ¢, lie within the triangular region but above the parabola in that 
figure, the autocovariance function y, is the sum of two decaying exponential 
functions of j. When @, and @, fall within the triangular region but below the 
parabola, y; is a damped sinusoidal function. 

The autocorrelations are found by dividing both sides of [3.4.25] by yp: 


0; = bi9j-1 + bopj-2 forj=1,2,.... [3.4.26] 
In particular, setting j = 1 produces 
Ai = 1 + GoM 
or 
P, = di/(1 — dy). [3.4.27] 
For / = 2, 
P2 = oP + dr. [3.4.28] 


The variance of a covariance-stationary second-order autoregression can be 
found by multiplying both sides of [3.4.24] by (Y, — «) and taking expectations: 


E(Y, — hw? = by E(¥,-1 — w)(Y — w) + br E(Y%-2 — wMY, - #) 
+ E(e)(¥, — B), 
or 
Yo = bir + b2¥2 + 07. [3.4.29] 
The last term (a?) in [3.4.29] comes from noticing that 
E(e)Y, — w) = E(e)[b(¥i- — w) + $2(¥.-2 — #) + el 
= $,:0 + @:0 + a7. 
Equation [3.4.29] can be written 
Yo = $101¥0 + $2P2%0 + O°. [3.4.30] 
Substituting [3.4.27] and [3.4.28] into [3.4.30] gives 
: $3 264 | 
= eet + o? 
» fess G- 4) > je * 2 
or 
ie (1 = $2)o? 
1 + [CL ~ 2? - 64] 
The pth-Order Autoregressive Process 
A pth-order autoregression, denoted AR(p), satisfies 
Y¥,= c+ Y¥i-1 + G2¥i-2 + 1+ + Yip + &. [3.4.31] 
Provided that the roots of 
1 — @z — gz? —-++-— $27 =0 [3.4.32] 


all lie outside the unit circle, it is straightforward to verify that a covariance- 
stationary representation of the form 


Y,= + d(Lye, [3.4.33] 


58 Chapter 3 | Stationary ARMA Processes 


exists where 
WL) = (1 ~ Gil ~ bl? — +++ — bp LP)7? 
and 279 |¥,| < %. Assuming that the stationarity condition is satisfied, one way 
to find the mean is to take expectations of [3.4.31]: 
B=c+ oet dGpts-> + dpe, 
or 
H=cKl - d1- d2—-++ — &). [3.4.34] 
Using [3.4.34], equation [3.4.31] can be written 
Y,- b= 6(¥%-1 - H) + @(Y-2- mw) ++: 
+ $,(Yi-p = H) + &,. 
Autocovariances are found by multiplying both sides of [3.4.35] by (Y,_; — ) and 
taking expectations: 
rae pee + oyj-2 + +++ + bpYy-p forj = 1,2,... 
: Gin + bev. + °° ° + bp, + a? for j = 0. 
Using the fact that y_, = y,, the system of equations in [3.4.36] for j = 0, 1, 
. pcan be solved for y, 1, -.., ¥p) as functions of a, di, ba, - +5 bp. It 
can be shown‘ that the (p x 1) vector (y, ¥1,- - +, Yp-1)’ is given by the first p 
elements of the first column of the (p? X p*) matrix o7[I,2 — (F @ F)]~! where 
F is the (p < p) matrix defined in equation [1.2.3] and @ indicates the Kronecker 
product. 
Dividing [3.4.36] by yp produces the Yule- Walker equations: 
p; = 1P;-1 + 2P;-2 ere ae $pPj-p for j = 1, 2, wee [3.4.37] 


Thus, the autocovariances and autocorrelations follow the same pth-order 
difference equation as does the process itself [3.4.31]. For distinct roots, their 
solutions take the form 


[3.4.35] 


[3.4.36] 


¥; = BA, + Bods +--+ + B,At, [3.4.38] 
where the eigenvalues (A,, .. . , A,) are the solutions to 
AP — g,AP7) - bdPm2 — ++ by = 0. 


3.5. Mixed Autoregressive Moving Average Processes 
An ARMA(p, q) process includes both autoregressive and moving average terms: 


Y, =ct+ OY ;-1 + d2Y,-2 Fee die bY :-p + & + A€;-1 [3.5.1] 
+ OnE, -2 tere t 6,€:-q) 


or, in lag operator form, 


(1 - dL - ol? — +++ — GLY, 


=ct+ (1 + 6,L + 6,L? oo 6,1.) E,. [3.5.2] 


Provided that the roots of 
1 — @z — gz? — +++ — G2? = 0 [3.5.3] 
“The reader will be invited to prove this in Exercise 10.1 in Chapter 10. 


248 BMlvead A. snunnn---2--- 14 . a =~ mn 


lie outside the unit circle, both sides of [3.5.2] can be divided by (1 — ¢,L — ¢2L? 
—--++ — @,L?) to obtain 


c= M+ WL)e, 
where 
(1 + 6,2 + 62? + +++ + 6,19) 
w(L) = 1 =A * a 5g = z P) 
( al ool as $,L ) 
> lvl < 
j=0 
w= cl ~ di - dr +++ — db) 
Thus, stationarity of an ARMA process depends entirely on the autoregressive 
parameters (¢,, $2, . . . , @,) and not on the moving average parameters (6,, 42, 
», 94). 


It is often convenient to write the ARMA process [3.5.1] in terms of deviations 
from the mean: 


Y, -— w = 6(¥,-1 — #) + O(¥i-2 — w) + -°° 


[3.5.4] 
+ (Yip — BM) + &: + Ey + O82 ++ + OfE 4. 


Autocovariances are found by multiplying both sides of [3.5.4] by (Y,._; — j) and 
taking expectations. For j > q, the resulting equations take the form 


¥ = b1yj-1 + 2-2 + °° + OY;-p forj=q+l1,q+2,.... [3.5.5] 


Thus, after q lags the autocovariance function y; (and the autocorrelation function 
p;) follow the pth-order difference equation governed by the autoregressive 
parameters. 

Note that [3.5.5] does not hold for j = q, owing to correlation between 0,¢,_, 
and Y,_,, Hence, an ARMA(p, q) process will have more complicated autocovar- 
iances for lags 1 through g than would the corresponding AR(p) process. For 
j > q with distinct autoregressive roots, the autocovariances will be given by 


y= AL + hag t+ +--+ + hal. [3.5.6] 
This takes the same form as the autocovariances for an AR(p) process [3.4.38], 
though because the initial conditions (7, 7,,..., ¥,) differ for the ARMA and 


AR processes, the parameters h, in [3.5.6] will not be the same as the parameters 
8x in [3.4.38]. 

There is a potential for redundant parameterization with ARMA processes. 
Consider, for example, a simple white noise process, 


Y, = &. [3.5.7] 
Suppose both sides of [3.5.7] are multiplied by (1 — pL): 
(1 ~ pL)Y, = (1 ~ pLye,. [3.5.8] 


Clearly, if [3.5.7] is a valid representation, then so is [3.5.8] for any value of p. 
Thus, [3.5.8] might be described as an ARMA(1, 1) process, with @, = p and 
6, = —p. It is important to avoid such a parameterization. Since any value of p 
in [3.5.8] describes the data equally well, we will obviously get into trouble trying 
to estimate the parameter p in [3.5.8] by maximum likelihood. Moreover, theo- 
retical manipulations based on a representation such as [3.5.8] may overlook key 
cancellations. If we are using an ARMA(1, 1) model in which 6, is close to — gy, 
then the data might better be modeled as simple white noise. 


60 = hapter3 | Stationary ARMA Processes 


A related overparameterization can arise with an ARMA(p, q) model. Con- 
sider factoring the lag polynomial operators in [3.5.2] as in [2.4.3]: 


(1 — AL) — AgL) +++ (1 — A,L)(Y; — #) 

= (1 — mL)(1 - mL)++- (1 - aL )e. 
We assume that |A,| < 1 for all i, so that the process is covariance-stationary. If 
the autoregressive operator (1 — ¢,L — ¢,L? — --- — ¢,L?) and the moving 
average operator (1 + 6,L + @,L? + -+- + 6,1‘) have any roots in common, 


say, A; = 7; for some i and j, then both sides of [3.5.9] can be divided by 
(1 — A,L): 


[3.5.9] 


fla -anm, - 0 = Ta - ube. 


kei kj 
or 
(LGPL - @fL? — +++ - oy, — w) (3.5.10) 
= (1+ OFL + OfL? +--+ + OF_,L97)e,, 
where 
(1 — $fL — g3L? — +--+ — $3, LP~') 


= (1 — AL)(1 — AgL) +++ (1 - Ay L). - Aver L) +++ (1 - APL) 
(1 + OTL + OFL? + +++ + 63_,L9-1) 
=(1 — mL) -— mL): ++ (1 - w-iL)( — nail): ++ - 1,2). 


The stationary ARMA(p, q) process satisfying [3.5.2] is clearly identical to the 
stationary ARMA(p — 1, g — 1) process satisfying [3.5.10]. 


3.6. The Autocovariance-Generating Function 


For each of the covariance-stationary processes for Y, considered so far, we cal- 
culated the sequence of autocovariances {y,}7 _... If this sequence is absolutely 
summable, then one way of summarizing the autocovariances is through a scalar- 
valued function called the autocovariance-generating function: 


gy(z) = Daw [3.6.1] 


This function is constructed by taking the jth autocovariance and multiplying it by 
some number z raised to the jth power, and then summing over all the possible 
values of j, The argument of this function (z) is taken to be a complex scalar. 

Of particular interest as an argument for the autocovariance-generating func- 
tion is any value of z that lies on the complex unit circle, 


Zz = cos(w) — isin(w) = e~, 


where i = \/—I and w is the radian angle that z makes with the real axis. If the 
autocovariance-generating function is evaluated at z = e~™ and divided by 27, 
the resulting function of w, 


1 oe oe 
sy(w) = sare) = 5 De, 
d 


max 


is called the population spectrum of Y. The population spectrum will be discussed 


3.6. The Autocovariance-Generating Function 61 


in detail in Chapter 6. There it will be shown that for a process with absolutely 
summable autocovariances, the function s,(w) exists and can be used to calculate 
all of the autocovariances. This means that if two different processes share the 
same autocovariance-generating function, then the two processes exhibit the iden- 
tical sequence of autocovariances. 

As an example of calculating an autocovariance-generating function, consider 
the MA(1) process. From equations [3.3.3] to [3.3.5], its autocovariance-generating 
function is 


8y(z) = [607]z~-1 + [(1 + ®)o?]z° + [007]z1= o?-[@z-1 + (1 + &) + Oz]. 
Notice that this expression could alternatively be written 
gy(z) = o7°(1 + 6z)(1 + 6271). [3.6.2] 
The form of expression [3.6.2] suggests that for the MA(q) process, 
Y,=mt+(1+ OL + 61? +--+ + 6,L%e,, 
the autocovariance-generating function might be calculated as 
By(z) = o%(1 + 6,2 + O27 + +++ + 6,27) [3.6.3] 
x (L + 27) + z-2 +++ + 279), 
This conjecture can be verified by carrying out the multiplication in [3.6.3] and 
collecting terms by powers of z: 
(1 + 0,2 + 0,27 +++ ++ 0,29) X (1+ 6,271 + 027-2 +++ + + 6,279) 
= (0,)24 + (0,—1 + 0,0,)2-) + (0,2 + 05-10, + 0,0,)29-2) 
ts + (0, + 0:0, + 0302 +++ + + 0,0,-1)2) [3.6.4] 
+ (1+ 67+ 63 +--+ + + 6?)2° 
+ (0, + 0,0, + 030. + +> ++ 6,0,-:)27-1 ++ ++ + (8,)274. 
Comparison of [3.6.4] with [3.3.10] or [3.3.12] confirms that the coefficient on z/ 


in [3.6.3] is indeed the jth autocovariance. 
This method for finding gy(z) extends to the MA(~) case. If 


Y,=p + WL)e, [3.6.5] 

with 
WL) = do t+ wh + wl? +--- [3.6.6] 

and 
> wl <@, [3.6.7] 

j=0 

then 

8y(z) = o7p(z)p(z-"). [3.6.8] 


For example, the stationary AR(1) process can be written as 

Y¥Y,-pwe(1- $L) "ex, 
which is in the form of [3.6.5] with y(L) = 1/(1 ~ @L). The autocovariance- 
generating function for an AR(1) process could therefore be calculated from 


Sy ee oe ee 
br) = G gad — 6)" 


62 Chapter 3 | Stationary ARMA Processes 


[3.6.9] 


To verify this claim directly, expand out the terms in [3.6.9]: 


o 


Tre ge Ut et HH HAH) 


x (1+ p27! + G2z-? + p23 + +: ), 
from which the coefficient on z/ is 
o(gi + pit'h + G76? + +--+) = a Pi(1 - ¢?). 


This indeed yields the jth autocovariance as earlier calculated in equation [3.4.5]. 
The autocovariance-generating function for a stationary ARMA(p, g) process 
can be written 


(2) o°(1 + 0,2 + O27 +++ + + O,29(1 + O27) + O27? +++ +.O,27%) 
)=— OOS SS 
ek (1 — iz — Gz? — + + + — bpz?)(1 — @iz7! = daz 2 -  - $pz~") 
[3.6.10] 
Filters 


Sometimes the data are filtered, or treated in a particular way before they 
are analyzed, and we would like to summarize the effects of this treatment on the 
autocovariances, This calculation is particularly simple using the autocovariance- 
generating function. For example, suppose that the original data Y, were generated 
from an MA(1) process, 


Y, = (1 + OL)e,, [3.6.11] 


with autocovariance-generating function given by [3.6.2]. Let’s say that the data 
as actually analyzed, X,, represent the change in Y, over its value the previous 
period: 


X,=Y,-Y,,=(1- Dy,. [3.6.12] 


Substituting [3.6.11] into [3.6.12], the observed data can be characterized as the 
following MA(2) process, 


X,=(1-L)(1 + 6L)e, = [1 + (@-1)L - 6L7Je,=[1+ 6,04 6,L7]e,, [3.6.13] 


with 6, = (@ — 1) and 6, = ~—@. The autocovariance-generating function of the 
observed data X, can be calculated by direct application of [3.6.3]: 


8x(z) = o%(1 + Oz + O27)(1 + 0,271 + 6,277), [3.6.14] 


It is often instructive, however, to keep the polynomial (1 + 6,z + 62?) in its 
factored form of the first line of [3.6.13], 


(1 + 6,2 + 6,27) = (1 — z)(1 + 62), 

in which case [3.6.14] could be written 

’ 8x(z) = o%(1 — z)(1 + 6z)(1 — z74)(1 + 6273) 
= (1 — 2) — 27") + By(z). 


Of course, [3.6.14] and [3.6.15] represent the identical function of z, and which 
way we choose to write it is Simply a matter of convenience. Applying the filter 


[3.6.15] 


3.6. The Autocovariance-Generating Function 63 


(1 — L) to Y, thus results in multiplying its autocovariance-generating function by 
(1 — z)(1 — 27%). 

This principle readily generalizes. Suppose that the original data series {Y} 
satisfies [3.6.5] through [3.6.7]. Let's say the data are filtered according to 


X, = A(L)Y, [3.6.16] 
with 


h(L) = > h,Li 


jHr= 


Substituting [3.6.5] into [3.6.16], the observed data X, are then generated by 
X, = Ale + A(L)p(L)e, = w* + W* (Ler, 


where p* = h(1)y and ¢*(L) = h(L)p(L). The sequence of coefficients associated 
with the compound operator {*}* _.. turns out to be absolutely summable,° and 
the autocovariance-generating function of X, can accordingly be calculated as 


8x(2) = o7p*(z)y*(271) = o7h(z)W(z)W(z-")h(Z-") = A(Z)A(z“")gy(z). [3.6.17] 


Applying the filter h(L) to a series thus results in multiplying its autocovariance- 
generating function by A(z)h(z~'). 
3.7. Invertibility 


Invertibility for the MA(1) Process 
Consider an MA(1) process, 


Y, — w= (1 + OL)e,, [3.7.1] 
with 

o* fort=T 

E = 
(ee) {° otherwise. 

5Specifically, 
w(z) = (3 nz'\(3 u2*) 
jars =0 
Be HAL ZH By ZH ee HALO + ftg2® + Az tee 


t+ hyzl + bye zit! + ++ \boz" + wiz! + daz? + ++ -), 


from which the coefficient on 2/ is 


yf = hyo + hye + hy 242 ya ad 2, hy - thy. 


Then 


x = 


> Wil= 2 


juve =x 


> hy = > lA; = > Al > lh,-¥1 = > |e] z la) <%, 
v=o i- =0 vel jo-= v=o i 


ev =x 


64 Chapter 3 | Stationary ARMA Processes 


Provided that |6| < 1, both sides of [3.7.1] can be multiplied by (1 + 6L)-1 to 
obtain® 


(1 - OL + @L? - O13 +---)(¥, - w) = «, [3.7.2] 


which could be viewed as an AR(~) representation. If a moving average repre- 
sentation such as [3.7.1] can be rewritten as an AR(~) representation such as [3.7.2] 
simply by inverting the moving average operator (1 + @L), then the moving average 
representation is said to be invertible. For an MA(1) process, invertibility requires 
|6| < 1; if |6| = 1, then the infinite sequence in [3.7.2] would not be well defined. 

Let us investigate what invertibility means in terms of the first and second 
moments of the process. Recall that the MA(1) process [3.7.1] has mean « and 
autocovariance-generating function 


gy(z) = oA(1 + Oz)(1 + 6z71). [3.7.3] 
Now consider a seemingly different MA(1) process, 
Y,- w= (1+ 6L)é,, [3.7.4] 


with 
c7 fort = 
E(éé,) = o or 
0 otherwise. 


Note that Y, has the same mean (1) as Y,. Its autocovariance-generating function 
is 
gy(z) = G1 + 6z)(1 + 62-1) 
= @{(6-1z71 + 1)(6z)} {((@-1z + 1)(6z-1)} [3.7.5] 
(o762)(1 + 6-4z)(1 + 6-127), 
Suppose that the parameters of [3.7.4], (6, &?), are related to those of [3.7.1] by 
the following equations: 


6 = 67} [3.7.6] 
o? = 66, [3.7.7] 


Then the autocovariance-generating functions [3.7.3] and [3.7.5] would be the 
same, meaning that Y, and Y, would have identical first and second moments. 

Notice from [3.7.6] that if |@| < 1, then |6| > 1. In other words, for any 
invertible MA(1) representation [3.7.1], we have found a noninvertible MA(1) 
representation [3.7.4] with the same first and second moments as the invertible 
representation. Conversely, given any noninvertible representation with |6| > 1, 
there exists an invertible representation with @ = (1/6) that has the same first and 
second moments as the noninvertible representation. In the borderline case where 
@ = +1, there is only one representation of the process, and it is noninvertible. 

Not only do the invertible and noninvertible representations share the same 
moments, either representation [3.7.1] or [3.7.4] could be used as an equally valid 
description of any given MA(1) process! Suppose a computer generated an infinite 
sequence of Y’s according to [3.7.4] with @ > 1. Thus we know for a fact that the 
data were generated from an MA(1) process expressed in terms of a noninvertible 
representation. In what sense could these same data be associated with an invertible 
MA(1) representation? 


Note from [2.2.8] that 
(i + OL)~! = [1 — (-@)L}"! = 1 + (-OL + (-OPL? + (- OL) + 


3.7. Invertibility 65 


Imagine calculating a series {e,}%_ _. defined by 


= 1+ 6L)-(Y, ee ) 
a of =n) — (8s nw) + Lina — w) - OL wy ee, B78 


where @ = (1/6) is the moving average parameter associated with the invertible 
MA(1) representation that shares the same moments as [3.7.4]. Note that since 
|a| < 1, this produces a well-defined, mean square convergent series {e,}. 

Furthermore, the sequence {e,} so generated is white noise. The simplest way 
to verify this is to calculate the autocovariance-generating function of ¢, and confirm 
that the coefficient on z/ (the jth autocovariance) is equal to zero for any j # 0. 
From [3.7.8] and [3.6.17], the autocovariance-generating function for e, is given 
by 

g(Z) = (1 + 6z)-*(1 + 62z71)-1g5-(z). [3.7.9] 

Substituting [3.7.5] into [3.7.9], 

g.(z) = (1 + 6z)“"(1 + 6z71)-(6762)(1 + O-3z)(1 + 671273) 3.7.10] 
= 66, a 


where the last equality follows from the fact that 9~* = @. Since the autocovariance- 
generating function is a constant, it follows that ¢, is a white noise process with 
variance 6762. 

Multiplying both sides of [3.7.8] by (1 + @L), 


¥,-p=(1 + Le, 


is a perfectly valid invertible MA(1) representation of data that were actually 
generated from the noninvertible representation [3.7.4]. 

The converse proposition is also true—suppose that the data were really 
generated from [3.7.1] with |6| < 1, an invertible representation. Then there exists 
a noninvertible representation with 6 = 1/0 that describes these data with equal 
validity. To characterize this noninvertible representation, consider the operator 
proposed in [2.5.20] as the appropriate inverse of (1 + 6L): 


(6)-*L- [1 — (@-1)L-! + (6-2)L-? — (6-3)L~3 + + + |] 
= OL [1 -— 6L-1 + @L-? — PL-3 +>], 

Define é, to be the series that results from applying this operator to (Y, — 1), 

&, = AY... — w) - 6(Y 42 -— KB) + O(Y,45 —~pB)r- ct, [3.7.11] 
noting that this series converges for |6| < 1. Again this series is white noise: 

gz) = {6z~4[1 — 271 + 672-2 — 6z-3 + - - J} 
x {6z[1 — oz! + 622? — 23 +++ -J}o2(1 + @z)(1 + 627) 
= @o?, 
The coefficient on z/ is zero for j # 0, so &, is white noise as claimed. Furthermore, 
by construction, 
Y,- w= (1 + 6L)é, 


so that we have found a noninvertible MA(1) representation of data that were 
actually generated by the invertible MA(1) representation [3.7.1]. 

Either the invertible or the noninvertible representation could characterize 
any given data equally well, though there is a practical reason for preferring the 


66 Chapter 3 | Stationary ARMA Processes 


invertible representation. To find the value of ¢ for date ¢ associated with the 
invertible representation as in [3.7.8], we need to know current and past values of 
Y. By contrast, to find the value of é for date ¢ associated with the noninvertible 
representation as in [3.7.11], we need to use all of the future values of Y! If the 
intention is to calculate the current value of e, using real-world data, it will be 
feasible only to work with the invertible representation. Also, as will be noted in 
Chapters 4 and 5, some convenient algorithms for estimating parameters and fore- 
casting are valid only if the invertible representation is used. 

The value of ¢, associated with the invertible representation is sometimes 
called the fundamental innovation for Y,. For the borderline case when |@| = 1, 
the process is noninvertible, but the innovation e, for such a process will still be 
described as the fundamental innovation for Y,. 


Invertibility for the MA(q) Process 


Consider now the MA(q) process, 


(Y,-pw) =(1+ 60+ 62? +--+ + @,L%e, [3.7.12] 
o fort=T 
E = 
(ee) {° otherwise. 


Provided that the roots of 
(1 + Oz + @z7> +--+ + 427) = 0 [3.7.13] 


lie outside the unit circle, [3.7.12] can be written as an AR(~) simply by inverting 
the MA operator, 


(1 + mL + mL? + mL? +---)(¥, - #) = &, 
where 
(1+ mb + mL? + ml? +--+) = (1+ 00+ 00? +--- + 0,L9)7). 


Where this is the case, the MA(q) representation [3.7.12] is invertible. 
Factor the moving average operator as 


(1+ OL + 6,L?+--++6,L7)=(1—A,L)(1 —A,L)---(1-A,£). [3.7.14] 


If |A,| < 1 for all 7, then the roots of [3.7.13] are all outside the unit circle and the 
representation [3.7.12] is invertible. If instead some of the A, are outside (but not 
on) the unit circle, Hansen and Sargent (1981, p. 102) suggested the following 
procedure for finding an invertible representation. The autocovariance-generating 
function of Y, can be written 


By(z) = o% + {(1 — Ayz)(1 — A,z)- + - (1 — A,z)} 


x {(1 = Az — Age) ++ (1 - AZT}. [3.7.15] 


Order the A’s so that (A,, A2, .. . , A,) are inside the unit circle and (A,41, Ay+25 

., Aq) are outside the unit circle. Suppose o? in [3.7.15] is replaced by o?- 
42 ,1°A2,2° + ‘A2; Since complex A; appear as conjugate pairs, this is a positive 
real number. Suppose further that (A,,41, A,v2. - +.» Ag) are replaced with their 


3.7. Invertibility 67 


reciprocals, (Ajj, Agdz, ..., Az). The resulting function would be 
OMe Mben = aff a - Hlth = a)| 
x 1 - A,;z7} 1 - Ap tz) 
i=l ‘ 


i=n4+1 


{fla - xaf{ TT teva - aay} 
x {fh = ee Il (az - wel 


ial =an+l1 


W 


W 
q 
N 
—S\ 
jam 
—, 
—_ 
\ 
> 
N 
~~ 
Kn—" 
as 
ls 
= 
N 
| 
na 
—" 


i=ne+l 


“ffl = wo} Ta nl, 


which is identical to [3.7.15]. 
The implication is as follows. Suppose a noninvertible representation for an 
MA(Q) process is written in the form 


x fi (1 - aie Il (ajz — - 


" 


q 
Y,=n+[[(1-A,Lé, [3.7.16] 
ivi 
where 
JA|<1  forf=1,2,...,n 
lAl>1 fori=n+1,n+2,...,9 
and 
: oe fort=7 
E(éé,) = 
(ée,) {¢ otherwise. 


Then the invertible representation is given by 


Y,=p+ {1 (1 - uit T {I la - aw L)en [3.7.17] 


i= 
where 

PAZ AR AR fort = 
0 otherwise. 


E(€é,) = { 


Then [3.7.16] and [3.7.17] have the identical autocovariance-generating function, 
though only [3.7.17] satisfies the invertibility condition. 

From the structure of the preceding argument, it is clear that there are a 
number of alternative MA(q) representations of the data Y, associated with all the 
possible “flips” between A, and A; 1. Only one of these has all of the A, on or inside 
the unit circle. The innovations associated with this representation are said to be 
the fundamental innovations for Y,. 


68 Chapter 3 | Stationary ARMA Processes 


APPENDIX 3.A. Convergence Results for Infinite-Order Moving 
Average Processes 


This appendix proves the statements made in the text about convergence for the MA(=) 
process [3.3.13]. 

First we show that absolute summability of the moving average coefficients implies square- 
summability. Suppose that {y,}*_, is absolutely summable. Then there exists an N < = such 
that || < 1 for all 7 = N, implying 4? < |y| for all j= N. Then 


= N-1 = N-L fa 
L4= dL y+ DH< Dd y+ D lw. 
j=0 j=a j=N 1=9 j=N 

But IN;' pis finite, since Nis finite, and Z7 . |¥,| is finite, since {y,} is absolutely summable. 
Hence 27_, ¥? < ™, establishing that [3.3.15] implies [3.3.14]. 

Next we show that square-summability does not imply absolute summability. For an 
example of a series that is square-summable but not absolutely summable, consider y, = 
l/j for j = 1,2, .... Notice that 1/7 > 1/x for all x > j, meaning that 


jr 
uj>f (/x) dx 
a 
and so 
N Nel 
Sy > f (1/x) dx = log(N + 1) ~ log(1) = log(N + 1), 
j=l 


which diverges to © as N — ~, Hence {hin is not absolutely summable. It is, however, 
square-summable, since 1/;? < 1/x? for all x <j, meaning that 


f 
Up< i _ lx?) dx 
a 
and so 
N N 
> wjz<1+ (1/x?) dx = 1 + (-1/x)|%, = 2 — (1/N), 
jel 


which converges to 2 as N—> ©. Hence {}"_, is square-summable. 

Next we show that square-summability of the moving average coefficients implies that 
the MA() representation in [3.3.13] generates a mean square convergent random variable. 
First recall what is meant by convergence of a deterministic sum such as 27, a, where {a,} 
is just a sequence of numbers. One criterion for determining whether 27, a, converges to 
some finite number as JT — © is the Cauchy criterion. The Cauchy criterion states that 
2j. 4, converges if and only if, for any « > 0, there exists a suitably large integer N such 
that, for any integer M > N, 


<6 


M N 
by a; — > a 
j=0 


{=0 


In words, once we have summed N terms, calculating the sum out to a larger number M 
does not change the total by any more than an arbitrarily small number e. 

For a stochastic process such as [3.3.13], the comparable question is whether 
Zao Wr-, CONVerges in mean square to some random variable Y, as T — ~. In this case 
the Cauchy criterion states that 27.9 ¢,e,-, converges if and only if, for any e > 0, there 
exists a suitably large integer N such that for any integer M > N 


[3 be; - ve <e. [3.4.1] 
nd f=0 . 


In words, once N terms have been summed, the difference between that sum and the one 
obtained from summing to M is a random variable whose mean and variance are both 
arbitrarily close to zero. 


3.A Canvervence Results for I fi ‘te-O der Moving Average Processes 69 


Now, the left side of (3.A.1] is simply 
El Watrem + ba 18rme Fo + Pe €r-nal 
= (Wht Fert + Re? 


-(S0- Su]. 


j20 


[3.A.2] 


But if 2% 4 y? converges as required by [3.3.14], then by the Cauchy criterion the right side 
of (3. A. vy may be made as small as desired by choice of a suitably large N. Thus the infinite 
series in [3.3.13] converges in mean square provided that [3.3.14] is satisfied. 

Finally, we show that absolute summability of the moving average coefficients implies 
that the process is efgodic for the mean. Write [3.3.18] as 


Y= FY Hare 
k=0 
Then 
lyl = by Yel - 
k=0 


A key property of the absolute value operator is that 
la + b + cl < [al + |b] + Ie]. 


Hence 
ly So? > Wand 
k=O 
and 
>» yl so >» 2, Wy ra¥al =e >» > [dae * || = o? 2, Wel 2, Wael. 


But there exists an M < © such that Z7.q || < M, and therefore Zo |W..| <M fork = 
0,1, 2,..., meaning that 


> lyl < 0? & Il: M< 02M? <o, 
j=0 k=0 


Hence [3.1.15] holds and the process is ergodic for the mean. 


Chapter 3 Exercises 


3.1. Is the following MA(2) process covariance-stationary? 
= (1+ 2.4L + 0.8L7)e, 


1 fort =7 
Eee.) = {3 otherwise. 


If so, calculate its autocovariances. 
3.2. Is the following AR(2) process covariance-stationary? 
(1 — 1.1L + 0.18L*)Y, = «, 

1 

E(ee,) = if 


fort =T 
otherwise. 


If so, calculate its autocovariances. 
3.3. A covariance-stationary AR(p) process, 
(1 — OL - $1? - +++ - $,1PV¥,- w= & 


70°) Chanter 3 | Stationary ARMA P acesse. 


has an MA(™) representation given by 
(¥, - #) r (Le, 
with 
WL) = 1V/l- 6L-¢L?----- 1] 
or 
{i -@L- 60? -++-- 6] [mt Wlbt+ pls: ]=1 
In order for this equation to be true, the implied coefficient on L° must be unity and the 
coefficients on L', L?, L3,. . . must be zero. Write out these conditions explicitly and show 
that they imply a recursive algorithm for generating the MA(~) weights %, %,... . Show 


that this recursion is algebraically equivalent to setting y, equal to the (1, 1) element of the 
matrix F raised to the jth power as in equation [1.2.28]. 


3.4. Derive [3.4.21]. 
3.5. Verify [3.4.22]. 
3.6. Suggest a recursive algorithm for calculating the AR(~) weights, 

Q+nb+ nl? t+:--\(¥- w= 5 
associated with an invertible MA(q) process, 

(Y¥,- we) = (1+ OL + OL? +--+ + O,L%e,. 

Give a closed-form expression for 7, as a function of the roots of 

(1 + 02 + @27 +--+ + 627) =0, 
assuming that these roots are all distinct. 


3.7. Repeat Exercise 3.6 for a noninvertible MA(q) process. (HINT: Recall equation 
(3.7.17].) 


3.8. Show that the MA(2) process in Exercise 3.1 is not invertible. Find the invertible 
representation for the process. Calculate the autocovariances of the invertible representation 
using equation [3.3.12] and verify that these are the same as obtained in Exercise 3.1. 


Chapter 3 References 


Anderson, Brian D. O., and John B. Moore. 1979. Optimal Filtering. Englewood Cliffs, 
N.J.: Prentice-Hall. + 


Hannan, E. J. 1970. Multiple Time Series. New York: Wiley. 

Hansen, Lars P., and Thomas J. Sargent. 1981. “‘Formulating and Estimating Dynamic 
Linear Rational Expectations Models,” in Robert E. Lucas, Jr., and Thomas J. Sargent, 
eds., Rational Expectations and Econometric Practice, Vol. I. Minneapolis: University of 
Minnesota Press. 

Rao, C. Radhakrishna. 1973. Linear Statistical Inference and Its Applications, 2d ed. New 
York: Wiley. 

Sargent, Thomas J. 1987. Macroeconomic Theory, 2d ed. Boston: Academic Press. 


Chanter 3 Refere ces 71 


Forecasting 


This chapter discusses how to forecast time series. Section 4.1 reviews the theory 
of forecasting and introduces the idea of a linear projection, which is a forecast 
formed from a linear function of past observations. Section 4.2 describes the fore- 
casts one would use for ARMA models if an infinite number of past observations 
were available. These results are useful in theoretical manipulations and in under- 
standing the formulas in Section 4.3 for approximate optimal forecasts when only 
a finite number of observations are available. 

Section 4.4 describes how to achieve a triangular factorization and Cholesky 
factorization of a variance-covariance matrix. These results are used in that section 
to calculate exact optimal forecasts based on a finite number of observations. They 
will also be used in Chapter 11 to interpret vector autoregressions, in Chapter 13 
to derive the Kalman filter, and in a number of other theoretical calculations and 
numerical methods appearing throughout the text. The triangular factorization is 
used to derive a formula for updating a forecast in Section 4.5 and to establish in 
Section 4.6 that for Gaussian processes the linear projection is better than any 
nonlinear forecast. 

Section 4.7 analyzes what kind of process results when two different ARMA 
processes are added together. Section 4.8 states Wold’s decomposition, which 
provides a basis for using an MA(*) representation to characterize the linear 
forecast rule for any covariance-stationary process. The section also describes a 
popular empirical approach for finding a reasonable approximation to this repre- 
sentation that was developed by Box and Jenkins (1976). 


4.1. Principles of Forecasting 


Forecasts Based on Conditional Expectation 


Suppose we are interested in forecasting the value of a variable Y,,, based 
on a set of variables X, observed at date t. For example, we might want to forecast 
Y,,, based on its m most recent values. In this case, X, would consist of a constant 
plus Y,, Y,-1,..., and Y,_y41- 

Let Y?, 1, denote a forecast of Y,,, based on X,. To evaluate the usefulness 
of this forecast, we need to specify a loss function, or a summary of how concerned 
we are if our forecast is off by a particular amount. Very convenient results are 
obtained from assuming a quadratic loss function. A quadratic loss function means 
choosing the forecast Y7, 4, So as to minimize 


E(Yiei — Yieud?- [4.1.1] 


72 


Expression [4.1.1] is known as the mean squared error associated with the forecast 
Yt, denoted 


MSE(Y #1) = E(Y,41 = Yee)? 


The forecast with the smallest mean squared error turns out to be the ex- 
pectation of Y,,, conditional on X,: 


Yi tte = E(¥,411X,). [4.1.2] 


To verify this claim, consider basing Y?,,;, on any function g(X,) other than the 
conditional expectation, 


Yun = g(X). [4.1.3] 
For this candidate forecasting rule, the MSE would be 
E(Y 41 a a(X,) = E(¥i41 _ E(Y,4:1X,) + E(Y,, IX,) = a(X,)P 
= ELYi41 a E(¥,4:1X)) [4.1.4] 
+ 2E{[Y 41 a E(¥.411X,)] [E(Y.4 1X.) = g(X,)]} 
+ E(E(¥,411X,.) — g(X,)}}. 
Write the middle term on the right side of [4.1.4] as 
2E [nai], [4.1.5] 
where 
hei = {(Yie1 ar E(¥,411X)] (E(¥.411X) a g(X,)]}. 


Consider first the expectation of 7,,, conditional on X,. Conditional on X,, the 
terms E(Y,,,|X,) and g(X,) are known constants and can be factored out of this 
expectation:? 


Eln+i1X] = [E(¥,411X,) = g(X.)] x E(¥i41 xt E(¥,41X)]|X.) 
= [E(¥i41X.) =: a(X,)] x 0 
= 0. 


By a straightforward application of the law of iterated expectations, equation [A.5.10], 
it follows that 


Elna] = Ex{E(m+11Xd) =0. 
Substituting this back into [4.1.4] gives 
E[Yi+1 see g(X,))? = E(¥ 41 = E(¥ 4X)? + E(E(¥.4 1X.) ~ g(X,)}). [4.1.6] 


The second term on the right side of [4.1.6] cannot be made smaller than zero, 
and the first term does not depend on g(X,). The function g(X,) that makes the 
mean squared error [4.1.6] as small as possible is the function that sets the second 
term in [4.1.6] to zero: 


E(¥,4:1X) = g(X,). [4.1.7] 


Thus the forecast g(X,) that minimizes the mean squared error is the conditional 
expectation E(Y,,,|X,), as claimed. 
The MSE of this optimal forecast is 


E[Yi41 - a(X,)F = E[Yia1 _ E(Y,4 1X,)). ' [4.1.8] 


'The conditional expectation E(Y,,.,|X,) represents the conditional population moment of the ran- 
dom variable Y,,, and is not a function of the random variable Y,,, itself. For example, if Y,..,|X, ~ 
N(a'X,, Q), then E(Y,,,|X,) = «’'X,, which does not depend on Y,,,. 


4.1. Principles of Forecasting 73 


Forecasts Based on Linear Projection 


We now restrict the class of forecasts considered by requiring the forecast 

Y, 1. to be a linear function of X,: 
Yous = @'X,. [4.1.9] 
Suppose we were to find a value for a such that the forecast error (Y,,, — a’X,) 

is uncorrelated with X,: 
E((¥,41 — @'X,)X;] =0'. [4.1.10] 
If [4.1.10] holds, then the forecast «'X, is called the linear projection of Y,,, on 
e 

The linear projection turns out to produce the smallest mean squared error 
among the class of linear forecasting rules. The proof of this claim closely parallels 
the demonstration of the optimality of the conditional expectation among the set 


of all possible forecasts. Let g’X, denote any arbitrary linear forecasting rule. Note 
that its MSE is 


ELY i 41 = e’X,) 
= E[Y,,, —@'X, + a’X, — g'X,]? 
= ELY,4,—@'X,P + 2E((Y,.1 — a'X] [a’X, — g’X,]} 
+ Ela'X, ~ g'X,P. 
As in the case of [4.1.4], the middle term on the right side of [4.1.11] is zero: 
E(¥ 43 ~ a'X,] [a’X, a g’X,)) = (E[Y 41 ~ a’X)X))[a as 8] = 0'[a a g], 
by virtue of [4.1.10]. Thus [4.1.11] simplifies to 
E(¥,41 — XJ? = E[Yi41 — aX, + Ela’'X, — g'X,]?. [4.1.12] 


The optimal linear forecast g’X, is the value that sets the second term in [4.1.12] 
equal to zero: 


[4.1.11] 


g'X, = a'X,, 

where a’X, satisfies [4.1.10]. 

For a’X, satisfying [4.1.10], we will use the notation 
PCY 411K.) = @'X,, 

or sometimes simply 
Tg Ue = a’X,, 

to indicate the linear projection of Y,,, on X,. Notice that 

MSE[P(Y,+:|X.)] = MSE[E(Y¥,.11X.)], 


since the conditional expectation offers the best possible forecast. 

For most applications a constant term will be included in the projection. We 
will use the symbol E to indicate a linear projection on a vector of random variables 
X, along with a constant term: 


E(¥ 411K) = PY ill, X))- 


Properties of Linear Projection 


It is straightforward to use [4.1.10] to calculate the projection coefficient a 
in terms of the moments of Y,,, and X,: 


E(¥.41%2) = a’ E(X,X;), 


74 Chapter 4 | Forecasting 


or 
a! = E(Y¥,.:X)[E(K,X))-*, [4.1.13] 


assuming that E(X,X/) is a nonsingular matrix. When E(X,X/) is singular, the 

coefficient vector a is not uniquely determined by [4.1.10], though the product of 

this vector with the explanatory variables, «'X,, is uniquely determined by [4.1.10].? 
The MSE associated with a linear projection is given by 


E(Y¥.41— @'X,)* = E(Y,,1)* — 2E(@'X,Y,4,) + E(a'X,XJa). [4.1.14] 
Substituting [4.1.13] into [4.1.14] produces 
E(¥ 41. — aX)? = E(¥ 41)? — 2B XYE(K-X)] “EK, 41) 
+ E(Y, aXNEKX)] [4.1.15] 
x E(X,X;)[E(X,X7)]“E(X,¥;41) 
= EVs)? — EY X EK XE Y 4 1)- 


Notice that if X, includes a constant term, then the projection of (@Y,,, +b) 
on X, (where a and b are deterministic constants) is equal to 


PLG@Y,., + bX] = a P(Y,.41X,) + b. 


To see this, observe that a-P(Y,,,|X,) + 5 is a linear function of X,. Moreover, 
the forecast error, 


(aYi41 + b) ~ [a-PCY,.:1X,) + b) = alY ist — PUY, IX,], 


is uncorrelated with X,, as required of a linear projection. 


Linear Projection and Ordinary Least Squares Regression 


Linear projection is closely related to ordinary least squares regression. This 
subsection discusses the relationship between the two concepts. 
A linear regression model relates an observation on y,,, to x,: 


Vier = Bx, + uy. [4.1.16] 


Given a sample of T observations on y and x, the sample sum of squared residuals 
is defined as 


T 


> (rer — B’x,)?. [4.1.17] 


t=1 


The value of B that minimizes [4.1.17], denoted b, is the ordinary least squares 
(OLS) estimate of B. The formula for b turns out to be 


T ian ey 
b= [2 x] Bb sei} [4.1.18] 


2If E(X,X;) is singular, there exists a nonzero vector ¢ such that c’-E(X,X/)-c = E(c'X,)? = 0, so 
that some linear combination c’X, is equal to zero for all realizations. For example, if X, consists of 
two random variables, the second variable must be a rescaled version of the first: X,, = ¢-Xj,. One 
could simply drop the redundant variables from such a system and calculate the linear projection of 
Y,4,0n X*, where X? is a vector consisting of the nonredundant elements of X,. This linear projection 
a*’X? can be uniquely calculated from [4.1.13] with X, in [4.1.13] replaced by X?. Any linear com- 
bination of the original variables «’X, satisfying [4.1.10] represents this same random variable; that is, 
a’X, = a*’X? for all values of « consistent with [4.1.10]. 


4.1. Principles of Forecasting 75 


which equivalently can be written 


b= [wn > xi] [wn Zax. [4.1.19] 


Comparing the OLS coefficient estimate b in equation [4.1.19] with the linear 
projection coefficient aw in equation [4.1.13], we see that b is constructed from the 
sample moments (1/T)27_,x,x; and (1/T)ZZ.,x,y,,, while a is constructed from 
population moments E(X,X;) and E(X,Y,,,). Thus OLS regression is a summary 
of the particular sample observations (x,, x), ... , Xz) and (yo, y3,---, Yrei)s 
whereas linear projection is a summary of the population characteristics of the 
stochastic process {X,, Y,4,} -=- 

Although linear projection describes population moments and ordinary least 
squares describes sample moments, there is a formal mathematical sense in which 
the two operations are the same. Appendix 4.A to this chapter discusses this parallel 
and shows how the formulas for an OLS regression can be viewed as a special case 
of the formulas for a linear projection. 

Notice that if the stochastic process {X,, Y,,,} is covariance-stationary and 
ergodic for second moments, then the sample moments will converge to the pop- 
ulation moments as the sample size T goes to infinity: 


T 
(UT) > XX! E(X,X/) 
t= 
ui P 
(1/T) > XY 17? E(X,Y,4 vs 
& 


implying 
boa. [4.1.20] 


Thus OLS regression of y,,, on x, yields a consistent estimate of the linear 
projection coefficient. Note that this result requires only that the process be ergodic 
for second moments. By contrast, structural econometric analysis requires much 
stronger assumptions about the relation between X and Y. The difference arises 
because structural analysis seeks the effect of X on Y. In structural analysis, changes 
in X are associated with a particular structural event such as a change in Federal 
Reserve policy, and the objective is to evaluate the consequences for Y. Where 
that is the objective, it is very important to consider the nature of the correlation 
between X and Y before relying on OLS estimates. In the case of linear projection, 
however, the only concern is forecasting, for which it does not matter whether it 
is X that causes Y or Y that causes X. Their observed historical comovements (as 
summarized by E(X,Y,,,)) are all that is needed for calculating a forecast. Result 
[4.1.20] shows that ordinary least squares regression provides a sound basis for 
forecasting under very mild assumptions. 

One possible violation of these assumptions should nevertheless be noted. 
Result [4.1.20] was derived by assuming a covariance-stationary, ergodic process. 
However, the moments of the data may have changed over time in fundamental 
ways, or the future environment may be different from that in the past. Where 
this is the case, ordinary least squares may be undesirable, and better forecasts 
can emerge from careful structural analysis. 


76 Chapter 4 | Forecasting 


Forecasting Vectors 


The preceding results can be extended to forecast an (m x 1) vector Y,,, on 
the basis of a linear function of an (m x 1) vector X,: 


PCY,,11X,) = a’X, = Y,4 p- [4.1.21] 

Then «' would denote an (n X m) matrix of projection coefficients satisfying 
E[(¥,41 — @’X,)X/] = 0; [4.1.22] 
that is, each of the n elements of (Y,,, — Vid) is uncorrelated with each of the 


m elements of X,. Accordingly, the jth element of the vector Y, +i gives the 
minimum MSE forecadt of the scalar Y;,,,,. Moreover, to forecast any linear com- 
bination of the elements of Y,,;, say, 2,,, = h’Y,,,, the minimum MSE forecast 
of z,,, requires Gis. - 2,414) to be uncorrelated with X,. But mie each of the 
elements of (Y,414 — Yr+14) is uncorrelated with X,, clearly h'(Y,4, — Y,. 1) is also 
uncorrelated with X,. Thus when Y,, ,, satisfies [4.1.22], then h’Y,, ,, is the min- 
imum MSE forecast of h’Y,,, for any value of h. 
From [4.1.22], the matrix of projection coefficients is given by 


= (E(¥.4:X2)] [EK .X2)] 7". [4.1.23] 
The matrix generalization of the formula for the mean squared error [4.1.15] is 
MSE(a'X,) = EY, 4 as a'X,] . [Yio1 ae a’X,]'} 


4.1.24 
= EW ¥en) ~ (EW XD) ECX CGY, |. 14 
4.2. Forecasts Based on an Infinite Number 
of Observations 
Forecasting Based on Lagged e’s 
Consider a process with an MA() representation 
(Y¥, - #) = W(L)e, [4.2.1] 
with e, white noise and 
WL) = SL 
j=0 
Yo = 
> lu) <. [4.2.2] 
j=0 


Suppose that we have an infinite number of observations on ¢ through date ¢, {e,, 
:-1, €:-2, .. .}, and further know the values of » and {i,, fp, . . .}. Say we want 
to forecast the value of Y,,,, that is, the value that Y will take on s periods from 
now. Note that [4.2.1] implies 


Yas = M+ Eres + Wlresar tt + Webra + Uses 


[4.2.3] 
+ Use 1G-1 ttt. 
The optimal linear forecast takes the form 
BLY, sli; Ei e  ] = Mt Wee t+ Weber + Ue2G-2 ttt. [4.2.4] 


4.2. Forecasts Based on an Infinite Number of Observations 77 


That is, the unknown future e’s are set to their expected value of zero. The error 
associated with this forecast is 


Yias 7 ELY, sl, Epa, ss J = Eras t Wheres ttt + Webra [4.2.5] 


In order for [4.2.4] to be the optimal linear forecast, condition [4.1.10] re- 
quires the forecast error to have mean zero and to be uncorrelated with e,, €,_1, 
. ... It is readily confirmed that the error in [4.2.5] has these properties, so [4.2.4] 
must indeed be the linear projection, as claimed. The mean squared error associated 
with this forecast is 


E(Yies — ELY, +16, Ep~ts+ + J? =(l+ vi +URteee + W2-1)o?. [4.2.6] 
For example, for an MA(q) process, 
WL) = 1+ 610+ 61727 ++-++6,L4, 


the optimal linear forecast is 


ElY,asl6p Er-1 «+ «] [4.2.7] 
7 fe + O,€, + O416-1 t+ + O€:-945 fors =1,2,...,q 
7 Be fors=q+1,q+2,... 
The MSE is 
o? fors = 1 
(1+ + 634°+++62_,)o? fors=2,3,...,q 
(1+ 63 + 63 ++++ + @2)o? fors=qt+1,q+2,.... 


The MSE increases with the forecast horizon s up until s = q. If we try to 
forecast an MA(q) farther than q periods into the future, the forecast is simply the 
unconditional mean of the series (E(Y,) = m) and the MSE is the unconditional 
variance of the series (Var(Y,) = (1 + 6{ + 03 + +++ + 62)o7). 

These properties also characterize the MA(=) case as the forecast horizon s 
goes to infinity. It is straightforward to establish from [4.2.2] that as s > ©, the 
forecast in [4.2.4] converges in mean square to y, the unconditional mean. The 
MSE [4.2.6] likewise converges to o? 27. 9¥/?, which is the unconditional variance 
of the MA(~) process [4.2.1]. 

A compact lag operator expression for the forecast in [4.2.4] is sometimes 
used. Consider taking the polynomial y(L) and dividing by L’*: 


a = L$ + fy Li + yl? test yi bo! + pL? 

+ We4iL! + Us+2L? ee ees 
The annihilation operator® (indicated by [-],) replaces negative powers of L by 
zero; for example, 


| fal = bet Ura iL + Wey ob? +++. [4.2.8] 


Comparing [4.2.8] with [4.2.4], the optimal forecast could be written in lag operator 
notation as 


EL Yea den G-1---)= e+ Fa Ey. [4.2.9] 


3This discussion of forecasting based on the annihilation operator is similar to that in Sargent (1987). 


, 


78 Chapter 4 | Forecasting 


Forecasting Based on Lagged Y’s 


The previous forecasts were based on the assumption that e, is observed 
directly. In the usual forecasting situation, we actually have observations on lagged 
Y’s, not lagged e’s. Suppose that the process [4.2.1] has an AR(~) representation 
given by 


n(L)(¥, — 2) = &, [4.2.10] 


where (L) = 2.9n/L/, m = 1, and Z7.o\n/| < ~. Suppose further that the AR 
polynomial 7(L) and the MA polynomial ¥(L) are related by 


n(L) = [¥(Z)]-*. [4.2.11] 
A covariance-stationary AR(p) model of the form 
(1 - @,L — @L? —--+--—  L?\(Y, ~ pw) = &, [4.2.12] 


or, more compactly, 
P(L)(Y, — #) = &; 
clearly satisfies these requirements, with n(L) = $(L) and #(L) = [é(L)]~'. An 
MA(q) process 
Y,-w=(1+0L+61?7+-+++6,L%e, [4.2.13] 
or 
Y, B= AL)e, 

is also of this form, with 4(L) = 6(L) and 7(L) = [6(L)]~!, provided that [4.2.13] 
is based on the invertible representation. With a noninvertible MA(q), the roots 
must first be flipped as described in Section 3.7 before applying the formulas given 
in this section. An ARMA(p, q) also satisfies [4.2.10] and [4.2.11] with #(L) = 
6(L)/o(L), provided that the autoregressive operator $(L) satisfies the stationarity 
condition (roots of @(z) = 0 lie outside the unit circle) and that the moving average 
operator 6(L) satisfies the invertibility condition (roots of 6(z) = 0 lie outside the 
unit circle). 

Where the restrictions associated with [4.2.10] and [4.2.11] are satisfied, ob- 


servations on {Y,, Y,_,, . . .} will be sufficient to construct {e,, ¢,.,, .. .}. For 
example, for an AR(1) process [4.2.10] would be 
(1 — LYY, - pw) = &.: [4.2.14] 


Thus, given @ and yw and observation of Y, and Y,_,, the value of «, can be 
constructed from 


&.= (Y¥, — #) — O(%-1 — #)- 
For an MA(1) process written in invertible form, [4.2.10] would be 
(1+ @L)-(Y, — w) = &. 
Given an infinite number of observations on Y, we could construct ¢ from 
&. = (¥,- #) - OY,-1- #) + (%,-2 - ») 
— &(Y,_ 37M) tect. 


Under these conditions, [4.2.10] can be substituted into [4.2. a to obtain the 
forecast of Y,,, aS a function of lagged Y’s: 


ElY¥..41¥,, Yi-1; oa J =pet + [XO] ALYY, cz H)5 


[4.2.15] 


4.2. orecasts B ed onan Infinite Number of Observations 79 


or, using [4.2.11], 


a eae 
EVV Yai <6] = p+ L Lz (Y, — p). [4.2.16] 


Equation [4.2.16] is known as the Wiener-Kolmogorov prediction formula. Several 
examples of using this forecasting rule follow. 


Forecasting an AR(1) Process 


For the covariance-stationary AR(1) process [4.2.14], we have 


WL) =A - gL) = 14 GL + PP + PL ess. [4.2.17] 
and 
ea = Gi t+ PIL! + PPL? + +++ = PHL - OL). [4.2.18] 


Substituting [4.2.18] into [4.2.16] yields the optimal linear s-period-ahead forecast 
for a stationary AR(1) process: 


i Oo : 
ElYil¥n Yow) = m+ page OL yy soy 


Bt $°(Y, — p). 
The forecast decays geometrically from (Y, — #) toward yw as the forecast horizon 


s increases. From [4.2.17], the moving average weight y, is given by /, so from 
[4.2.6], the mean squared s-period-ahead forecast error is 


[1+ 2+ pit ees + PE-Vo?, 


Notice that this grows with s and asymptotically approaches o7/(1 ~— ¢7), the 
unconditional variance of Y. 


Forecasting an AR(p) Process 


Next consider forecasting the stationary AR(p) process [4.2.12]. The Wiener- 
Kolmogorov formula in [4.2.16] essentially expresses the value of (Y,,, — ) in 
terms of initial values {(Y, — #), (Y,-1 — #), . . .} and subsequent values of {e,,;, 
142) +--+» rast and then drops the terms involving future e’s. An expression of 
this form was provided by equation [1.2.26], which described the value of a variable 
subject to a pth-order difference equation in terms of initial conditions and sub- 
sequent shocks: 


Yias 7 B= FO, - p) + F241 py tert Re G caren - B) 


+ Ergs + WyErgs—1 + Wb 4s-2 t+ + Webra, 
[4.2.20] 
where 
b = fi. [4.2.21] 


80 Chanter 4 | Foec tig 


Recall that f¥) denotes the (1, 1) element of F/, Ff denotes the (1, 2) element of 
F’, and so on, where F is the following (p Xx p) matrix: 


d; 2 3 bp-1 dy 
1 0 0 :--. QO 0 
F=/0 1 0 0 0 
00 0-:--- 1 0 


The optimal s-period-ahead forecast is thus 
ee =pt FEC, = ph) + fO%-1 a #) fee 
+ cA @ eae — p). 


Notice that for any forecast horizon s the optimal forecast is a constant plus a linear 
function of {Y,, Yi-1,. - - » Y;-p+i}- The associated forecast error is 


[4.2.22] 


Yies = View = Eris + Wier s—1 + WrEr45—2 ae 2 W,~ E041 [4.2.23] 
The easiest way to calculate the forecast in [4.2.22] is through a simple re- 
cursion. This recursion can be deduced independently from a principle known as 
the law of iterated projections, which will be proved formally in Section 4.5. Suppose 
that at date ¢ we wanted to make a one-period-ahead forecast of Y,, ,. The optimal 
forecast is clearly 
(Pea ty 7 7) = O(Y, a #) + $2(Y,-1 = H) ae set 
+ b(Y,-p+1 a H#). 
Consider next a two-period-ahead forecast. Suppose that at date ¢ + 1 we were 
to make a one-period-ahead forecast of Y,,2. Replacing ¢ with t + 1 in [4.2.24] 
gives the optimal forecast as 
(Pra2jet — B= O(Y 41 — pw) + $2(Y, —p)tor: 
a $AY-p+2 ~ pH). 
The law of iterated projections asserts that if this date t + 1 forecast of Y,,2 is 
projected on date ¢ information, the result is the date ¢ forecast of Y,,2. At date 
tthe values Y,, Y,-1,... 5 Yi-p+2 in [4.2.25] are known. Thus, 
(Ps —p)= bi(Fe+ a -— pp) + O(¥,- ew) +-°: 
a $,(¥i-p+2 a HK). 
Substituting [4.2.24] into [4.2.26] then yields the two-period-ahead forecast for an 
AR(p) process: 
(Visvap -— ») = b[O(¥, — we) + (Yi — w+ + O(¥% pa — 
+ {Y, — w) + b3(¥-1 — H) + 5+ + b(Yi-ps2 - ») 
= (bt + bY, — w) + (bide + b3)(Ki-1 — H+ 
+ (bib,-1 # by )(Y-p +2 + #) + $16(¥-pai =, BH). 


The s-period-ahead forecasts of an AR(p) process can be obtained by iterating 
on : 


[4.2.24] 


[4.2.25] 


[4.2.26] 


(Pia jte = 1H) = b1(Ye4j-16 —p)t+ $2(¥ 4-24 —p) tee: 


4.2.27 
+ bY +)—pk — p) 


4.2 F ecasts Base onanIni ite Number o Observations 81 


forj = 1,2,...,s where 


¥, = Y, forr St. 


Forecasting an MA(1) Process 
Next consider an invertible MA(1) representation, 
Y,~ w= (1 + OL)e, [4.2.28] 


with |6| < 1. Replacing Y(L) in the Wiener-Kolmogorov formula [4.2.16] with 
(1 + @L) gives 


1+ 6L 1 
Yrs = we + EL I+ OL (Y, — pw). [4.2.29] 
+ 


To forecast an MA(1) process one period into the future (s = 1), 


and so 


6 
= ye 
Vici B+ 1+ aL & t #) [4.2.30] 


w+ OY, — w) — 6°(Y,-. — w) + OY... - Bw) - ee. 


It is sometimes useful to write [4.2.28] as 


1 
&=T per he 


and view e, as the outcome of an infinite recursion, 


é = (¥, — #) — 08-1. [4.2.31] 
The one-period-ahead forecast [4.2.30] could then be written as 
Vise = oe t+ 68, [4.2.32] 


Equation [4.2.31] is in fact an exact characterization of ¢,, deduced from 
simple rearrangement of [4.2.28]. The “hat” notation (é,) is introduced at this point 
in anticipation of the approximations to e, that will be introduced in the following 
section and substituted into [4.2.31] and [4.2.32]. 


To forecast an MA(1) process for s = 2,3,... periods into the future, 
| =0 fors = 2,3,...;3 
Bd, 


and so, from [4.2.29], 
Yi,4 =o fors =2,3,.... [4.2.33] 


Forecasting an MA(q) Process 
For an invertible MA(q) process, 
(% -— ew) = (1+ 604+ 1? +--+ + 6L%e,, 


82 Chapter 4 | Forecasting 


the forecast [4.2.16] becomes 


1+6,b4+ 6,127 +--- + 6,L7 
\ ane =p + emnaneas aemeemen 
+ [4.2.34] 
1 
x — p). 
ee ee eer a,3a ¥) 
Now 
- + OL + OL? +--+ + nt) 
Ls a 
ll aaa aiid fors=1,2,...,4q 
0 fors=qt+i,qg+2,.... 
Thus, for horizons of s = 1,2,... , q, the forecast is given by 


Pisce =pet+ (9, + 6,,L + 0,421? ap este ty 6,L7-*)é,, [4.2.35] 
where é, can be characterized by the recursion 
é, = (Y, — w) — 96-1 — :6)-2 — °° — Obs g- [4.2.36] 


A forecast farther than q periods into the future is simply the unconditional 
mean p. 


Forecasting an ARMA(I, 1) Process 
For an ARMA(1, 1) process 
(1 - L)(Y, - w) = (1 + Lye, 
that is stationary (|@| < 1) and invertible (|6| < 1), 


Pree = wt Fesaak ; = = (yieay: [4.2.37] 
Here 
1+6L 
(1-@L)L*], 
- [eteet ees 1), OLU+ gh + gl? +: 2) 
= <f [4.2.38] 


(6° + gette + ort? L? rr o(os~! +¢L+ ott L? +++) 
=(p' + OG )1+GL+ PL? +---) 
io d + Ci 
~ 1-@L 

Substituting [4.2.38] into [4.2.37] gives 


‘ s+ Obs-'| 1 - OL 

Yissie =pet+ [eee] tr (Y, 7 #) 
_ bf + Ogf-} [4.2.39] 
SURE Tae ape (¥, - #). 


4.2. Fore ats Based on an Infinite Numhe of Observations 83 


Note that fors = 2,3,..., the forecast [4.2.39] obeys the recursion 
‘Caer — B= Fi s— ye — p). 


Thus, beyond one period, the forecast decays geometrically at the rate ¢ toward 
the unconditional mean yz. The one-period-ahead forecast (s = 1) is given by 
A ee + 6 
Fou =H + 7 or 


(=p). [4.2.40] 


This can equivalently be written 


Qiai —B)= #0 +O) + Oey, —B)= (Y, — BH) + 6, [4.2.41] 
where 
(1 - ¢L) 
ea aL) #) 
or 
é,= (Y,- H) - $(Y,-1- H) — 68,., = Y,— Pant [4.2.42] 


Forecasting an ARMA(p, q) Process 
Finally, consider forecasting a stationary and invertible ARMA(p, q) process: 
(l-—@,L-—¢,1?---+-- 6, L°)(Y¥,- bw) = (146046127 +--+ + 6,L%)e,. 
The natural generalizations of [4.2.41] and [4.2.42] are 
(Lise — #) = bY, — w) + (¥%-1-— we) + 


+ b,(Yi-p+1 — B) + O18 + On, +++ + O89 41, ina] 
with {é,} generated recursively from 
& = Y,— Pyar. [4.2.44] 
The s-period-ahead forecasts would be 
(Liase — #) [4.2.45] 
b(Piss-1e — H) + G(Viss—ae — w) +67 + bp(Vras-pe — #) 
= + 6.6, + O416-1 + °° + Obr4sg fors = 1,2,...,q 


b(¥ i451 — p+ b2(Yi4s—24 —p) tcc t+ PAY erect — p) 
fors=q+1,q+2,..., 


where 
Yu=Y¥, forrst. 
Thus for a forecast horizon s greater than the moving average order q, the forecasts 


follow a pth-order difference equation governed solely by the autoregressive 
parameters. 


84 Chapter 4 | Forecasting 


4.3. Forecasts Based on a Finite Number 
of Observations 


The formulas in the preceding section assumed that we had an infinite number of 
past observations on Y, {Y;, Y,_1, . . .}, and knew with certainty population pa- 
rameters such as uw, ¢, and 6. This section continues to assume that population 
parameters are known with certainty, but develops forecasts based on a finite 
number of observations {Y,, Y;_1,..- , Yr-mei1} 

For forecasting an AR(p) process, an optimal s-period-ahead linear forecast 
based on an infinite number of observations {Y,, Y,_1, . . .} in fact makes use of 
only the p most recent values {Y,, Y,1,..-, Y,-p41}. For an MA or ARMA 
process, however, we would in principle require all of the historical values of Y in 
order to implement the formulas of the preceding section. 


Approximations to Optimal Forecasts 


One approach to forecasting based on a finite number of observations is to 
act as if presample e’s were all equal to zero. The idea is thus to use the approx- 
imation 


EK d¥o Year+ -) 
[4.3.1] 
= EY \Y,. Yi-1 bir oad 2 Y, m+.) Erm = 0, © m—1 = 0, sa .). 


For example, consider forecasting an MA(q) process. The recursion [4.2.36] can 
be started by setting 


Evem= bm=1 see es by m—q+l = 0 [4.3.2] 


and then iterating on [4.2.36] to generate €,_441:€;-m+2. +++» & These calcu- 
lations produce 

Ej-m4t = (Yiems1 a Ht), 

E,-m+2 = (Y,-m+2 — B)— mat 

E,-m43 = (Y,- mas — H) ~ 916: ng 2 — Er mas 
and so on. The resulting values for (é,,é-1, ..., gs) are then substituted 


directly into [4.2.35] to produce the forecast [4.3.1]. For example, fors =q = 1, 
the forecast would be 


Vij =pet ay, = #) -_ P(Y-1 ms #) 


4.3.3 
#09 — wy (HUY ay 97) 

which is to be used as an approximation to the AR(~) forecast, 
wet OY, — w)~ OY. — ow) + O(N 2 -— mw) [4.3.4] 


For m large and |6| small, this clearly gives an excellent approximation. For 
|6| closer to unity, the approximation may be poorer. Note that if the moving 
average operator is noninvertible, the forecast [4.3.1] is inappropriate and should 
not be used. 


4.3. Forecasts Based on a Finite Number of Observations 85 


Exact Finite-Sample Forecasts 


An alternative approach is to calculate the exact projection of Y,,, on its m 
most recent values. Let 


Yi-m+1 
We thus seek a linear forecast of the form 
a)'X, = af” + af” Y, + af Y,_, Fee OY aida [4.3.5] 


The coefficient relating Y,,, to Y, in a projection of Y,,, on the m most recent 
values of Y is denoted a” in [4.3.5]. This will in general be different from the 
coefficient relating Y,,, to Y, in a projection of Y,,, on the m + 1 most recent 
values of Y; the latter coefficient would be denoted af”"*. 

If Y, is covariance-stationary, then E(Y,Y,_,) = y, + ?. Setting X, = (1, Y,, 


Yi-1,- ++ Yreme1)’ in [4.1.13] implies 
ou! = fag git. - + am} 
=[eH (+H?) (2 + #7) ++ Om + HY] 
-1 

1 mn Bb ih bo 
BH wt nte rs Wma tw? [4.3.6 

xf BR wnte voter 95s mez tw 
Mom te Yma2t wos tw? 


When a constant term is included in X,, it is more convenient to express 
variables in deviations from the mean. Then we could calculate the projection of 
(Yes = #) on X, = [(Y, a, 1), (Yi-1 = #), wees (Yiemer — #)]': 


Yr aa oe iia ol(Y, ~p)t ary —B)te: [4.3.7] 
+ OY nai — ). 


For this definition of X, the coefficients can be calculated directly from [4.1.13] to 
be 


a” Yo vn St Vmn=1 - v1 
ay”? v1 Yo "* Vmn-2 Y2 

= “ é a [s [4.3.8] 
a” Ym-1 Ym-2 °°" Yo Yin 


We will demonstrate in Section 4.5 that the coefficients (a”,a%”,..., 
a”) in equations [4.3.8] and [4.3.6] are identical. This is analogous to a familiar 
result for ordinary least squares regression—slope coefficients would be unchanged 
if all variables are expressed in deviations from their sample means and the constant 
term is dropped from the regression. 


86 Chapter 4 | Forecasting 


To generate an s-period-ahead forecast s eee we would use 


Vichy =pet ary, = #) + rads © aan s H) ahr 
+ yk @ Sere — B); 


where 
ae % | ay 
1,8) eae 
a = i : - me Mes [4.3.9] 
at) Yn-1 Yn-2 °°" Yo Ys+m-—1 


Using expressions such as [4.3.8] requires inverting an (m x m) matrix. 
Several algorithms can be used to evaluate [4.3.8] using relatively simple calcula- 
tions. One approach is based on the Kalman filter discussed in Chapter 13, which 
can generate exact finite-sample forecasts for a broad class of processes including 
any ARMA specification. A second approach is based on triangular factorization 
of the matrix in [4.3.8]. This second approach is developed in the next two sections. 
This approach will prove helpful for the immediate question of calculating finite- 
sample forecasts and is also a useful device for establishing a number of later 
results. 


4.4. The Triangular Factorization of a Positive Definite 
Symmetric Matrix 


Any positive definite symmetric (n < m) matrix © has a unique representation of 
the form 


= apa’, [4.4.1] 


where A is a lower triangular matrix with 1s along the principal diagonal, 


1 0 0 0 
a, 1 0 0 
A=| 41 4. 1 0 |, 
Qn, Ang An3 1 
and D is a diagonal matrix, 
dq, 9 0 0 
0 dyn 0 0 
p=| 0 O 4d; 0 |, 
0 0 0 dan 


where d;, > 0 for all i. This is known as the triangular factorization of Q. 


4.4. Factorization of a Positive Definite Symmetric Matrix 87 


To see how the triangular factorization can be calculated, consider 


Oy Xp 3 srs Oh, 
Oy Ox Qn +++ Ob, 

Q=} M3, O32 O33 --* Os, |, [4.4.2] 
On OL 0,3 weeny Onn 


We assume that Q is positive definite, meaning that x‘Qx > 0 for any nonzero 
(n x 1) vector x. We also assume that © is symmetric, so that Q, = Oy. 

The matrix © can be transformed into a matrix with zero in the (2, 1) position 
by multiplying the first row of O by 0,,0;;! and subtracting the resulting row from 
the second. A zero can be put in the (3, 1) position by multiplying the first row 
by 03,07)! and subtracting the resulting row from the third. We proceed in this 
fashion down the first column. This set of operations can be summarized as pre- 
multiplying Q by the following matrix: 


1 00: 0 
SR O-' 49.545 6 

E,= | —9:975' 0 1 0 [4.4.3] 
-0,,05' 0 0 --- 1 


This matrix always exists, provided that 2,, # 0. This is ensured in the present 
case, because 2, is equal to e;Qe,, wheree; = [1 0 0- - - 0]. Since Q is positive 
definite, e,Me, must be greater than zero. 

When Q is premultiplied by E, and postmultiplied by E; the result is 


E,0E} = H, [4.4.4] 
where 
Ay, 0 O -:+- 0 
0 hy hay t+ Aap 
H= O Ay, Ayy ¢ ++ Ayn [4.4.5] 
0 hy hys Aan 
On 0 0 cae 0 
0 Ox — 02,0702 Og — AAT OD: + 7° Og, — OnO7'D, 
= 0 Og, — 05,0 5'0,2 Og3 — 05,.05'Q3 +++ Os, — 1,07, 
0 Ong — 14071042 Ons — MOGs 0° Onn — OOO, 


We next proceed in exactly the same way with the second column of H. The 
approach now will be to multiply the second row of H by h3,hy and subtract the 
result from the third row. Similarly, we multiply the second row of H by hahy! 
and subtract the result from the fourth row, and so on down through the second 


88 Chapter 4 | Forecasting 


column of H. These operations can be represented as premultiplying H by the 
following matrix: 


1 0 0 0 
0 1 0 --: 0 

E,=| 9 —Ashy' 1 a [4.4.6] 
0 -Ayhn 0 ++: 1 


This matrix always exists provided that h., + 0. But h.. can be calculated as 
hy = e3He,, where e; = [0 1 0-:- 0]. Moreover, H = E,QOE}, where 2 is 
positive definite and E, is given by [4.4.3]. Since E, is lower triangular, its deter- 
minant is the product of terms along the principal diagonal, which are all unity. 
Thus E, is nonsingular, meaning that H = E,QE; is positive definite and so h,, = 
e}He, must be strictly positive. Thus the matrix in [4.4.6] can always be calculated. 

If H is premultiplied by the matrix in [4.4.6] and postmultiplied by the trans- 
pose, the result is 


E,HE, = K, 
where 
hy, 0 0 see 0 
0 hy 0 see 0 
K=| 0 0  hy3 — hyghp'hog °° * Ban ~ Aschn'hon 
0 0 Fins = hr2hz hos ie Ran = hgh ho, 


Again, since H is positive definite and since E, is nonsingular, K is positive 
definite and in particular k33 is positive. Proceeding through each of the columns 
with the same approach, we see that for any positive definite symmetric matrix 0 


there exist matrices E,, E,,..., E,_, such that 
E,-1 °° ° E,E,QE;E,--- E,_, = D, [4.4.7] 

where 
D= 

Oy 0 0 se 0 

0 O- 4,040). 0 eee 0 

0 0 hag — Ayghg hog + + - 0 ; 

0 0 0 19 Can — Can—1On-1,n—1n—1,1 


with all the diagonal entries of D strictly positive. The matrices E, and E, in [4.4.7] 
are given by [4.4.3] and [4.4.6]. In general, E, is a matrix with nonzero values in 
the jth column below the principal diagonal, 1s along the principal diagonal, and 
zeros everywhere else. _ 

Thus each E, is lower triangular with unit determinant. Hence E;"’ exists, 
and the following matrix exists: 


A = (E,-1 "+ * EXE,)~! = Ey Ey? +: - EZ-,. [4.4.8] 


4,4. Factorization of a Positive Definite Symmetric Matrix 89 


If [4.4.7] is premultiplied by A and postmultiplied by A’, the result is 
Q = ADA’. [4.4.9] 


Recall that E, represents the operation of multiplying the first row of O by 
certain numbers and subtracting the results from each of the subsequent rows. Its 
inverse E71 undoes this operation, which would be achieved by multiplying the 
first row by these same numbers and adding the results to the subsequent rows. 
Thus 


1 O0O0O--- 0 

MOG 1:0 --- 0 

E;? = | 93,95' 0 1 ++: Of], [4.4.10] 
O05! 0 0 --- 1 


as may be verified directly by multiplying [4.4.3] by [4.4.10] to obtain the identity 
matrix. Similarly, 


1 0 0 -:: 0 

0 1 0 --:- 0 
Ez! = 0 hsphx' 1 -:: 0 ; 

0 hahy' 0 --: 1 


and so on. Because of this special structure, the series of multiplications in [4.4.8] 
turns out to be trivial to carry out: 


1 0 0 77+ 0 
A=] 93,079" Asha 1-7: 0]. [4.4.11] 
OO? hyghi' Ky ak33" ees 


That is, the jth column of A is just the jth column of E;"’. 

We should emphasize that the simplicity of carrying out these matrix multi- 
plications is due not just to the special structure of the E;"' matrices but also to 
the order in which they are multiplied. For example, A~' = E,_,E,_.°°° E, 
cannot be calculated simply by using the jth column of E, for the jth column of 
Az. 

Since the matrix A in [4.4.11] is lower triangular with 1s along the principal 
diagonal, expression [4.4.9] is the triangular factorization of 0. 

For illustration, the triangular factorization © = ADA’ of a (2 x 2) matrix 


eel Lot 
Oy Oy 4,07" 1 [4.4.12] 


x Ee 0 | 1 9710, 
0 Oy - Oy FD 0 1 . 


is 


an “ alton 


while that of a (3 X 3) matrix is 


OQ, A As 0 0 
OQ, Ay Og3 ae iv 1 0 
1 

2 


O3, O32 Os; 0305" haha! [4.4.13] 
un 0 0 1 OF'O, OF 
x 0 hy 0 0 1 hz'has | , 
0 0 hy —hshsth||0 0 1 


where hy = (Qo. — M072), hss = (33 — 5,07;10)3), and hy = hz = 
(M3 ~ O,07D)3). 


Uniqueness of the Triangular Factorization 


We next establish that the triangular factorization is unique. Suppose that 
Q = A,D,Aji = A,D,A), [4.4.14] 


where A, and A, are both lower triangular with 1s along the principal diagonal and 
D, and D, are both diagonal with positive entries along the principal diagonal. 
Then all the matrices have inverses. Premultiplying [4.4.14] by D> !A;! and post- 
multiplying by [A3]~! yields 


A;[Aj]~! = Dy+Ay+A,D,. [4.4.15] 


Since Aj is upper triangular with 1s along the principal diagonal, [A3]~! must 
likewise be upper triangular with 1s along the principal diagonal. Since Aj is also 
of this form, the left side of [4.4.15] is upper triangular with 1s along the principal 
diagonal. By similar reasoning, the right side of [4.4.15] must be lower triangular. 
The only way an upper triangular matrix can equal a lower triangular matrix is if 
all the off-diagonal terms are zero, Moreover, since the diagonal entries on the 
left side of [4.4.15] are all unity, this matrix must be the identity matrix: 


Aj[A2]7! = 


Postmultiplication by A, establishes that A, = ys Premultiplying [4.4.14] by A~? 
and postmultiplying by [A’]~! then yields D, = Dy. 


The Cholesky Factorization 


A closely related factorization of a symmetric positive definite matrix Q is 
obtained as follows. Define D’? to be the (m x m) diagonal matrix whose diagonal 
entries are the square roots of the corresponding elements of the matrix D in the 
triangular factorization: 


Vd, «0 0 0 
0 Vd, 0 -:- O 
D2 = 0 0 Vd; --- O 
0 0 0 ::: Va, 


Since the matrix D is unique and has strictly positive diagonal entries, the matrix 
D’? exists and is unique. Then the triangular factorization can be written 


Q= AD!2p!2a' _ AD’?(AD!”)’ 


44° Factarizatinn of a Positive Definite Svmmerric M trix 91 


or 


QO = PP’, [4.4.16] 
where 
P = AD!2 
1 0 0 0 |i Va, 0 0 - 0 
yr Ae 0 0 0 Vd, O - 0 
0 0 Vas - 0 


Ml 
‘~) 
oe 
= 
a 
“ow 
i) 
— 
i] 


Q 

Qo 

Q 
a 
3 


Gny Anz An 


Vai, 0 0 0 
Vd; Vd22 0 heres 0 
= a3,Vai, ax.V dx Vdy33 ++ * 0 


OVE, Gx dx Gps ds3 +++ Vaan 
Expression [4.4.16] is known as the Cholesky factorization of Q. Note that P, like 
A, is lower triangular, though whereas A has 1s along the principal diagonal, the 
Cholesky factor has the square roots of the elements of D along the principal 
diagonal. 


4.5. Updating a Linear Projection 


Triangular Factorization of a Second-Moment Matrix 
and Linear Projection 


Let Y = (Y,, Y2,.--, ¥,)’ be an (m X 1) vector of random variables whose 
second-moment matrix is given by 
Q = E(YY’). [4.5.1] 
Let Q = ADA’ be the triangular factorization of Q, and define 
Y=A"'y. [4.5.2] 
The second-moment matrix of these transformed variables is given by 
E(YY’') = E(A~'YY'[A']-!) = ATHE(YY’)[A] 7}. [4.5.3] 


Substituting [4.5.1] into [4.5.3], the second-moment matrix of ¥ is seen to be 
diagonal: 


E(¥Y’) = A7!Q[A’]-! = A“!ADA'[A’]~! = D. [4.5.4] 
That is, 
— d; fori = j 
E(Y,¥) =4- : 
(4¥) e for i + j. per 


Thus the Y’s form a series of random variables that are uncorrelated with 
one another.* To see the implication of this, premultiply [4.5.2] by A: 


AY = Y. [4.5.6] 
“We will use “Y, and Y; are uncorrelated” to mean “E(Y,Y,) = 0.” The terminology will be correct 


if Y, and Y, have zero means or if a constant term is included in the linear projection. 


92 Chapter 4 | Forecasting 


Expression [4.4.11] can be used to write out [4.5.6] explicitly as 


1 0 Oo --- Oo] Y, Y, 
007! 1 6. aan oh Y, 
03,071  hyhz! 1 -++ OFF ¥s | =] ¥s |. [4.5.7] 
OO! Ayah! kaskzg! +--+ 1 y, Y,, 
The first equation in [4.5.7] states that 
Y,=¥%, [4.5.8] 


so the first elements of the vectors Y and Y represent the same random variable. 
The second equation in [4.5.7] asserts that 


M07, + ¥. = Yo, 

or, using [4.5.8], 

Y, = Y, — 0,,07'Y, = ¥, - a, [4.5.9] 
where we have defined a = 2,07!. The fact that Y, is uncorrelated with Y, 
implies 

E(¥,¥,) = E[(¥, — eY,)¥,] = 0. [4.5.10] 
But, recalling [4.1.10], the value of a that satisfies [4.5.10] is defined as the coef- 
ficient of the linear projection of Y, on Y,. Thus the triangular factorization of QO 
can be used to infer that the coefficient of a linear projection of Y, on Y, is given 
by a = 04,071, confirming the earlier result [4.1.13]. In general, the row i, column 
1 entry of A is 0,0;;', which is the coefficient from a linear projection of Y; on 
Y,. 

Since Y, has the interpretation as the residual from a projection of Y, on Y,, 

from [4.5.5] dx. gives the MSE of this projection: 


E(Y3) = dy = Ny — O,07'Dp?. 


This confirms the formula for the MSE of a linear projection derived earlier (equa- 
tion [4.1.15]). 
The third equation in [4.5.7] states that 


03,07'Y, + haghe!¥, + Y3 = Y3. 
Substituting in from [4.5.8] and [4.5.9] and rearranging, 
Y3 = Ys — 03,071, — Ayahg'(Y2 — O94'Y,). [4.5.11] 


Thus Y; is the residual from subtracting a particular linear combination of Y, and 
Y, from Y;. From [4.5.5], this residual is uncorrelated with either Y, or Y>: 


EL[Ys — 95,.05'Y, — hyhZ(¥, — ,07Y))]¥, = 0  forj = 1 or2. 


Thus this residual is uncorrelated with either Y, or Y,, meaning that Y, has the 
interpretation as the residual from a linear projection of Y; on Y, and Y,. According 
to [4.5.11], the linear projection is given by 


P(YAY2,Y,) = UNG, + Ayghe(Y2 — OMY). [4.5.12] 


The MSE of the linear projection is the variance of Y,, which from [4.5.5] is given 
by d33: 


E[Y3 - PrY|¥.,Y)P = haz — hyhy'hn. [4.5.13] 


4.5. Updating a Linear Projection 93 


Expression [4.5.12] gives a convenient formula for updating a linear projec- 
tion. Suppose we are interested in forecasting the value of Y;. Let Y, be some 
initial information on which this forecast might be based. A forecast of Y; on the 
basis of Y, alone takes the form 


P(Y,IY,) = 03,071. 
Let Y, represent some new information with which we could update this forecast. 


If we were asked to guess the magnitude of this second variable on the basis of Y, 
alone, the answer would be 


YY.) = MiMi'¥,. 
Equation [4.5.12] states that 
PCY3|¥2,Y,) = P(¥sIY1) + hoha'lY. - PcYAAY,). [4.5.14] 
We can thus optimally update the initial forecast P(Y3|Y,) by adding to it a multiple 
(h32hz9') of the unanticipated component of the new information [Y, — P(Y,|Y,)]. 
This multiple (43,h3') can also be interpreted as the coefficient on Y, in a linear 
projection of Y; on Y, and Y,. 
To understand the nature of the multiplier (h3,h3'), define the (nm x 1) vector 
¥(1) by 
Y(1) = E,Y, [4.5.15] 
where E, is the matrix given in [4.4.3]. Notice that the second-moment matrix of 
Y¥(1) is given by 
E{¥(1)[Y(1)]'} = E{E,YY’E}} = E,QE;. 
But from [4.4.4] this is just the matrix H. Thus H has the interpretation as the 
second-moment matrix of Y(1). Substituting [4.4.3] into [4.5.15], 


Y 
Y, aa 2,075'Y, 
¥(1) = | Ys — 95,07'¥; 
Y, oa OnOn'Y, 


The first element of ¥(1) is thus just Y, itself, while the ith element of Y(1) for 
i = 2,3,...,m is the residual from a projection of Y, on Y,. The matrix H is 
thus the second-moment matrix of the residuals from projections of each of the 
variables on Y,. In particular, h2. is the MSE from a projection of Y, on Yj: 


hy = E[Y, i PY|Y)P, 


while h3, is the expected product of this error with the error from a projection of 
Y; on Y,: 


hs = EX[Y; — Pcysl¥,)¥2 — Pcv¥,)p. 


Thus equation [4.5.14] states that a linear projection can be updated using the 
following formula: 


P(Y3|¥2,¥,) = Py,|Y,) 
+ {ELY; — Pcy|Y)ILY. - Pcvayd} [4.5.16] 
x {E[Y2 - PcyJY,) Pi! x [Y, - Pcy.Y,)]. 


94 Chapter 4 | Forecasting 


For example, suppose that Y, is a constant term, so that P(Y,|Y,) is just 2, the 
mean of Y>, while P(Y3|Y,) = 1. Equation [4.5.16] then states that 
P(Y3|¥a1) = Hs + Cov(¥s, ¥o)-[Var(¥a)]~"(¥2 — Ha). 


The MSE associated with this updated linear projection can also be calculated 
from the triangular factorization. From [4.5.5], the MSE from a linear projection 
of Y3; on Y, and Y, can be calculated from 


E{Y; — P(ys¥.,¥,)F = E(¥3) 
= dss 
= h33 — hahha. 


In general, for i > 2, the coefficient on Y, in a linear projection of Y; on Y, 
and Y, is given by the ith element of the second column of the matrix A. For any 


i > j, the coefficients on Y, in a linear projection of Y; on Y;, Y,_,,..., Y, is 
given by the row i, column j element of A. The magnitude d, gives the MSE for 
a linear projection of Y; on Y;_,, Y;-2,..-, Yi. 


Application: Exact Finite-Sample Forecasts for an MA(1) 
Process 


As an example of applying these results, suppose that Y, follows an MA(1) 
process: 


Y,= p+ e+ 6&1, 


where e, is a white noise process with variance o? and 6 is unrestricted. Suppose 
we want to forecast the value of Y,, on the basis of the previous m — 1 values (Yj, 
Yo,. ++, Yy-4). Let 


Y= ((Y, IF #) (Y2 ~ #) sae (Yn-1 = #) (Y, i #)), 
and let 0 denote the (mn X nm) variance-covariance matrix of Y: 


1+6 6 O -- 0 
6 8148 6 «+. 9Q 

2 = E(YY') = 07] 0 Oo Ae eae Sh fa a7] 
0 0 0 --- 14+¢6 


Appendix 4.B to this chapter shows that the triangular factorization of 0 is 


A= [4.5.18] 
1 0 Ons 0 0 
i] 
ey Sar id ° ° 
o(1 + 6?) 

1+ @ + 6 : : 

O[1 + 62 + OF ++ + G2H-2)] 

S : a 1+ 67 + OF 4 +--+ + GD 


4.5. Updating a Linear Projection 95 


D= [4.5.19] 


1+ 6 0 0 oe 0 
0 0 0 ee ees 


LE OH OH + ROD 


To use the triangular factorization to calculate exact finite-sample forecasts, 
recall that Y,, the ith element of Y = A~'Y, has the interpretation as the residual 
from a linear projection of Y, on a constant and its previous values: 


YS 2> E(YAY-1, Yi-a.. +5 Yi). 
The system of equations AY = Y can be written out explicitly as 
P a =Y,- p 
tee Pee 
a1 + 67) = 
Pema Ie ee 
o[1 + @ + OF + +--+ + Gr") | : 


1+ 627 + 64 Hees + Qre-D Yn-1+ Y, = Y, — B. 


Solving the last equation for Y,,, 


Y,- EW AY a1 Yaeare ees Y,) = Y,- » 
OL 08 Ot ees 2 Oe A] 
ia 14+ 62 + 64 rie e g2@-) [Yn-1 ~ Evy, oA) arn Y,-3 soeee gy Y,)], 
implying 
E(¥Al¥ nes Ya-a--->¥) = [4.5.20] 
Of + 62 + OF +--+ + O"-?) = 
+ 14+ 62 + 64 pee 92-1) LY, 1, 16 ie b ee Y,,-3) say Y,)]. 


The MSE of this forecast is given by d,,,: 


- 14+67+644+---+ 6% 


MSE[E(Y,|¥n—-1 Yn-2 + - Yi =o Teta 4 pe 


[4.5.21] 

It is interesting to note the behavior of this optimal forecast as the number 
of observations (n) becomes large. First, suppose that the moving average repre- 
sentation is invertible (|6| < 1). In this case, as m — ~, the coefficient in [4.5.20] 


tends to 6: 
ofl + 62 + OF + ++ + GX" FJ P 
1+ e+ Ott + + ord 


while the MSE [4.5.21] tends to o”, the variance of the fundamental innovation. 
Thus the optimal forecast for a finite number of observations [4.5.20] eventually 
tends toward the forecast rule used for an infinite number of observations [4.2.32]. 


96 Chapter 4 | Forecasting 


Alternatively, the calculations that produced [4.5.20] are equally valid for a 
noninvertible representation with |6| > 1. In this case the coefficient in [4.5.20] 
tends toward @7!: 


ofl + 62 + OF +--+ + OF") Of1 — G2" YK — 62) 
1+ 62+ 644+ +++ + 6-D ~ (1 — OY — 62) 
7 6(6- = 6~?) 
GE ge 
6(- 07?) 
-1 
= 97! 


Thus, the coefficient in [4.5.20] tends to 6! in this case, which is the moving 
average coefficient associated with the invertible representation. The MSE [4.5.21] 
tends to o76?: 


[1 - ood - 6) 
7 - ey ey 


which will be recognized from [3.7.7] as the variance of the innovation associated 
with the fundamental representation. 

This observation explains the use of the expression “fundamental” in this 
context. The fundamental innovation e, has the property that 


Vo SEY ae oe [4.5.22] 


as m~> © where —> denotes mean square convergence. Thus when |6| > 1, the 
coefficient 6 in the approximation in [4.3.3] should be replaced by @~!. When this 
is done, expression [4.3.3] will approach the correct forecast as m— ~. 

It is also instructive to consider the borderline case 6 = 1. The optimal finite- 
sample forecast for an MA(1) process with @ = 1 is seen from [4.5.20] to be given by 


n-1 


ECV AY n=: Yaa oY) = B+ 2 


vats = E(Y,~11¥n-2)Yn=3s- = YI), 


which, after recursive substitution, becomes 


EY, Vn 24s : ae Carry Y,) 
n-1 n-2 
= tT (Yana ~ H) — = (Yn-2 > 2) [4.5.23] 
n- 3 p 1 
$a nee ee ee Oe), 


The MSE of this forecast is given by [4.5.21]: 
o7(n + 1)/n— o?. 


Thus the variance of the forecast error again tends toward that of e,. Hence the 
innovation e, is again fundamental for this case in the sense of [4.5.22]. Note the 
conttast between the optimal forecast [4.5.23] and a forecast based on a naive 
application of [4.3.3], 


b+ (Y,-1 = #) a (Y,,-2 ~ BH) + (Y,-3 ae #) 


Sa ae me [4.5.24] 


The approximation [4.3.3] was derived under the assumption that the moving average 
representation was invertible, and the borderline case @ = 1 is not invertible. For this 


4.5. Updating a Linear Projection 97 


reason [4.5.24] does not converge to the optimal forecast [4.5.23] as m grows large. 
When 6 = 1, Y, = w + & + &_, and [4.5.24] can be written as 


B+ (€,-1 + En~2) ~ (€,,-2 + E,~3) + (€,~3 az E,-4) 
= ee (-1)(e; + 69) = B+ Bp + (—1)"e6- 


The difference between this and Y,,, the value being forecast, is €, — (—1)"e9, 
which has MSE 2c? for all n. Thus, whereas [4.5.23] converges to the optimal 
forecast as n> ™, [4.5.24] does not. 


Block Triangular Factorization 


Suppose we have observations on two sets of variables. The first set of var- 
iables is collected in an (m, X 1) vector Y; and the second set in an (m, X 1) vector 
Y,. Their second-moment matrix can be written in partitioned form as 


us ae pee) . he rl 
E(¥,¥i) E(¥2¥2) OQ, OQ, |’ 
where Q,,, is an (nm, X 7) matrix, Q» is an (mn, X m) matrix, and the (m, X m) 
matrix Q,,, is the transpose of the (”, x m,) matrix 0). 


We can put zeros in the lower left (n. x m,) block of O by premultiplying 
Q, by the following matrix: 


w Lata 
* -9,,07' 1] 
If O is premultiplied by E, and postmultiplied by Ej, the result is 


I, | Qy Ap kB - O70) 
-,,0;7' I OM, XQ» 0 I, 


n 


[4.5.25] 


_ & 0 
0 OQ» = 02,07'O2» 
Define 


mis ce I 0 
A=E;'= | ee | 
: 2,05! I, 


If [4.5.25] is premultiplied by A and postmultiplied by A’, the result is 


_ I,,, 0 
7 9,07" Ty 
Qn 0 \[s | 
x 4.5.26 
0 Ay — 2,075 '0,. 0 I, 
= ADA’. 


This is similar to the triangular factorization Q = ADA’, except that D is a block- 
diagonal matrix rather than a truly diagonal matrix: 


a ee 
0 Qy - 04,0 7'Oy» : 


98 Chapter 4 | Forecasting 


As in the earlier case, D can be interpreted as the second-moment matrix of 


the vector Y = A7!Y, 
ele ada 
Y, —02,05' Th, LY2 ; 


that is, ¥, = Y, and Y, = Y, — ,,03'Y,. The ith element of Y, is given by Yo, 
minus a linear combination of the elements of Y,. The block-diagonality of 
D implies that the product of any element of Y. with any element of Y, has 
expectation zero. Thus 2,,Q7;! gives the matrix of coefficients associated with the 
linear projection of the vector Y, on the vector Y,, 


Poly.) = 2,,07'¥,, [4.5.27] 
as claimed in [4.1.23]. The MSE matrix associated with this linear projection is 


E{[y, - Pcv.ly IY. - Pcyly,)]'} 


E(¥2Y5) 
= D., [4.5.28] 
22 — O,07'Op, 


as claimed in [4.1.24]. 

The calculations for a (3 x 3) matrix similarly extend to a (3 x 3) block 
matrix without complications. Let Y,, Y2, and Y; be (m, X 1), (m, x 1), and (m3 x 1) 
vectors, A block-triangular factorization of their second-moment matrix is obtained 
from a simple generalization of equation [4.4.13]: 


Qn Ay Oy I, 0 0 
On, Ay, Oh3| = | O2,07' I, 0 
Q3, OQ O33 03,07) Hs2Hz' I, 


3: 


[4.5.29] 
OQ, 0 0 T,, QF'A, A710; 
x] 0 Hy 0 0 1, HH 
0 0 Hy — H;.H3'H,,| | 0 0 I 


my 


where Hz, = (Qo. — 071012), Has = (O33 — 03,07;',3), and H,, = H32 
= (O23 — Oy07'A)3). 

This allows us to generalize the cartier result [4.5.12] on updating a linear 

projection. The optimal forecast of Y, conditional on Y, and Y, can be read off 
the last block row of A: 


P(Y3|¥2,¥,) = 05,075'Y, + HyHe (Y2 — 2,07'Y)) 


= P(¥,|¥,) + HyHs'[¥. — Pcvly,)], te 
where 
Hy = Ex{¥. — Pcv.¥,)[¥. — P(¥al¥,)I'} 
Hs, = E{(¥s — Pvsl¥,)|[¥2 — Pcval¥,)}'}- 


The MSE of this forecast is the matrix generalization of [4.5.13], 
EX{IY; — P(¥s|¥2,¥)][¥s — P(Wsl¥2,¥,)]'} = Has — HHH, [4.5.31] 


4.5, Updating a Linear Projection 99 


where 
Ha; = E{[Y; — P(ys|¥)][¥; — Pcvsl¥,)]'}. 


Law of Iterated Projections 


Another useful result, the law of iterated projections, can be inferred im- 
mediately from [4.5.30]. What happens if the projection P(Y,|Y,,Y,) is itself 
projected on Y,? The law of iterated projections says that this projection is equal 
to the simple projection of Y; on Y,: 


PLP(Y3I¥2,¥,)1¥i] = PCVSIY,). [4.5.32] 
To verify this claim, we need to show that the difference between P(y;|¥2,¥,) 
and P(y,|Y,) is uncorrelated with Y,. But from [4.5.30], this difference is given 
by 
P(¥;|¥o.¥1) — P(¥3I¥,) = HaoHa'[¥. — PCv.l¥,)], 
which indeed is uncorrelated with Y, by the definition of the linear projection 


Py.lY,). 


4.6. Optimal Forecasts for Gaussian Processes 


The forecasting rules developed in this chapter are optimal within the class of linear 
functions of the variables on which the forecast is based. For Gaussian processes, 
we can make the stronger claim that as long as a constant term is included among 
the variables on which the forecast is based, the optimal unrestricted forecast turns 
out to have a linear form and thus is given by the linear projection. 

To verify this, let Y, be an (”, x 1) vector with mean p,, and Y, an (nm, x 1) 
vector with mean jz, where the variance-covariance matrix is given by 

Be — pa)(¥. - wi)’ EW - wi)(% - | ra na | 
E(X2 - pe)(¥, — wa)! E(X2 — we)(¥2 — pe)’ Qy, Qe» 


If Y, and Y, are Gaussian, then the joint probability density is 


Qu Qh 
OQ, OQ» 


pa | 
_l Pa ’ = 7 {Qn QA» yi 7 Pa 
x ex| 5 (G1 ~ 1)! (2 — 2)’ is 0, yee | 
The inverse of Q is readily found by inverting [4.5.26]: 
Q- = [ADA’]7! 
= [A’]-'D-1A-? 
I, as) ra 0 
= ; 4.6.2 
E 1, JLo @a2-o,0%0,)] 97! 


«| In, - al 
-0,,075 I, 


Likewise, the determinant of 1 can be found by taking the determinant of [4.5.26]: 
|Q] = [Al - [D] - [A’L. 


1 -2 


Fyy.volVas Yo) = Greve 


100 Chapter 4 | Forecasting 


But A is a lower triangular matrix. Its determinant is therefore given by the product 


of terms along the principal diagonal, all of which are unity. Hence |A| = 1 and 
|Q| = |D|:° 


On Qn 
Qn OQ» 


Qn 0 

0 Oy — 2707, [4.6.3] 
|Q;,| + |{Q22 — A,07'2,]. 
Substituting [4.6.2] and [4.6.3] into [4.6.1], the joint density can be written 


I 


Fyiy2¥p Y2) 


1 
|Q4 7 1? + [O22 — Ay A7'0,,.|-V? 


~ (myer 
1 a [l, 2710 
x exp| -} [1 — Pa)’ 2 — #2)’ ‘ is °] 


x ee 0 I, | ei = ali 
0 (Qa — O,07510,2)-7] | - 22107? L,] Ly2 -— pe 


1 
Pi (Zar) mrt mayi2 [Qu |71? * |Qz2 — M510,.|- 7 
1 
* exo -3 [ys — wa)! (2 — m)’] [4.6.4] 
es 0 i x )} 
0 (M2 — AAG»)? Ly2 — m 


1 
= (myonnaye |Qy 17? + [O22 — Mp,0510,,.|- 1? 


1 ie 
x exp| -} (y; — pa)’OR"(y1 — B) 
1 
= 5 We — m)'(Q2 — 22,07'0Q,)" My. - n)}. 
where 


m =p, + 05%, — Hw). [4.6.5] 


The conditional density of Y2 given Y, is found by dividing the joint density 
[4.6.4] by the marginal density: 


1 1 
fry) = Gay al” oxp| ~5 (v1 — wa)'OR"M - »)}. 


SWrite 0, in Jordan form as M,J,M;"', where J, is upper triangular with cigenvalues of ,, along 
the principal diagonal. Write 2,. — 2,,07;'0,, as M.J,M;'. Then 2 = MJM™', where 


M, 0 | i | 
M = J= ‘ 
0 M: 0 kk 
Thus © has the same determinant as J. Because J is upper triangular, its determinant is the product 
of terms along the principal diagonal, or |J| = |J,| « |J2|. Hence |] = |9,,|-|Q. — 0,,.07'Q,1. 


4.6. Optimal Forecasts for Gaussian Processes 101 


The result of this division is 
Fyyyo¥1, Yo) 
fray, (yay) a fy,(y) 


_ lil|-2 exp] -> (yo - m)'H-%y, — m)|, 
(27) 2 


where 
H = QO» — 02,0710). [4.6.6] 
In other words, 
Y,|Y, ~ N(m, H) 
[4.6.7] 
= (lo + 0,075" — #1)], [O22 - 0,04). 


We saw in Section 4.1 that the optimal unrestricted forecast is given by the 
conditional expectation. For a Gaussian process, the optimal forecast is thus 


E(¥21¥1) = we + O,075y, — py). 


On the other hand, for any distribution, the linear projection of the vector Y, on 
a vector Y, and a constant term is given by 


E(X1¥1) = po + O,054(y1 — 1). 


Hence, for a Gaussian process, the linear projection gives the unrestricted optimal 
forecast. 


4.7. Sums of ARMA Processes 


This section explores the nature of series that result from adding two different 
ARMA processes together, beginning with an instructive example. 


Sum of an MA(1) Process Plus White Noise 
Suppose that a series X, follows a zero-mean MA(1) process: 
X, = u, + bu,_4; [4.7.1] 
where u, is white noise: 
mui) = i‘. for j = 0 
The autocovariances of X, are thus 
(1 + 6)o2 = forj = 0 
E(X,X,_)) = 4 802 forj = +1 [4.7.2] 
0 otherwise, 


otherwise. 


Let v, indicate a separate white noise series: 


o2 forj =0 


E(v,v,-;) = {e [4.7.3] 


otherwise. 


102 Chapter 4 | Forecasting 


Suppose, furthermore, that v and u are uncorrelated at all leads and lags: 
E(u,v,-;) = 9 for all j, 

implying 
E(X,v,-;) = 9 for all j. [4.7.4] 


Let an observed series Y, represent the sum of the MA(1) and the white noise 
process: 
Y, = &, 
aC aN [4.7.5] 
=u, + 6u,, + V;,. 


The question now posed is, What are the time series properties of Y? 
Clearly, Y, has mean zero, and its autocovariances can be deduced from [4.7.2] 
through [4.7.4]: 
E(Y,Y,_;) = E(X, + v)(X%—j + vr) 
E(X,X,_)) + E(y,v,_;) 


(1 + 6&)o2 +62 forj =0 [4.7.6] 
= 4 602 forj = +1 
0 otherwise. 


Thus, the sum X, + v, is covariance-stationary, and its autocovariances are zero 
beyond one lag, as are those for an MA(1). We might naturally then ask whether 
there exists a zero-mean MA(1) representation for Y, 


Y, =&+ 0&,_1, [4.7.7 
with 
o forj =0 
E(eé-;) = ‘ otherwise, 


whose autocovariances match those implied by [4.7.6]. The autocovariances of 
[4.7.7] would be given by 


(1 + 6?)o? = forj = 0 
E(Y,Y,-;) = 4 007 forj = +1 
0 - otherwise. 


In order to be consistent with [4.7.6], it would have to be the case that 


(1 + 6)o? = (1 + 8&)02 + o [4.7.8] 
and 
60? = 802. [4.7.9] 
Equation [4.7.9] can be solved for o, 
o? = 802/6, [4.7.10] 


and then substituted into [4.7.8] to deduce 
(1 + 67)(602/6) = (1 + 8)o2 + o2 
(1 + 6)6 = [(1 + 8) + (ot/o7)]@ 
667 — [(1 + 8) + (o2/o2)]6 + § = 0. [4.7.11] 


4.7. Sums of ARMA Processes 103 


For given values of 5, 07, and o?, two values of 6 that satisfy [4.7.11] can be found 
from the quadratic formula: 


[1 + 8% + (o%/02)] = VIL + 8) + ODE = 


6 = pe gg ae [4.7.12] 
If o2 were equal to zero, the quadratic equation in [4.7.11] would just be 
56? — (1 + 67)8 + 5 = 8(6 — 8)(6 — 8-1) = 0, [4.7.13] 


whose solutions are 9 = Sand 6 = 8-1, the moving average parameter for X, from 
the invertible and noninvertible representations, respectively. Figure 4.1 graphs 
equations [4.7.11] and [4.7.13] as functions of @ assuming positive autocorrelation 
for X, (6 > 0). For 6 > 0 and o2 > 0, equation [4.7.11] is everywhere lower than 
[4.7.13] by the amount (o2/o2)6, implying that [4.7.11] has two real solutions for 
6, an invertible solution 6* satisfying 
0 < |e*| <|él, [4.7.14] 
and a noninvertible solution 6* characterized by 
1 < |8-| < [6*|. 


Taking the values associated with the invertible representation (@*, o*?), let 
us consider whether [4.7.7] could indeed characterize the data {Y,} generated by 
[4.7.5]. This would require 


(1 + o*L)e, = (1 + SL)u, + v, [4.7.15] 
or 


(1 + 6*L)-1[(1 + 6L)u, + v,] 

(u, — O*u,-, + O*7u,-2 — 67,3 + °° *) 

+ Su, — O*uU,.. + O*7u,.3 — Ou, + °°) 
+ (v, — O*v,, + O*7v,_2 — O*v,3 +++) 


&, 


[4.7.16] 


The series e, defined in [4.7.16] is a distributed lag on past values of u and v, so 
it might seem to possess a rich autocorrelation structure. In fact, it turns out to be 


[4.7.13] 


[4.7.11] 


FIGURE 4.1 Graphs of equations [4.7.13] and [4.7.11]. 


104 Chapter 4 | Forecasting 


white noise! To see this, note from [4.7.6] that the autocovariance-generating 
function of Y can be written 


gy(z) = (1 + Sz)o2(1 + 82-1) + ?, [4.7.17] 
so that the autocovariance-generating function of c«, = (1 + 6*L)—'Y, is 


_ (1 + 6z)o2(1 + 8271) + 0? 
BZ) = T+ oezy(l + Oz) te) 
But 6* and o*? were chosen so as to make the autocovariance-generating function 
of (1 + 6*L)e,, namely, 


(1 + 6*z)o*7(1 + 6*z-1), 
identical to the right side of [4.7.17]. Thus, [4.7.18] is simply equal to 


g.(z) = o*?, 


a white noise series, 

To summarize, adding an MA(1) process to a white noise series with which 
it is uncorrelated at all leads and lags produces a new MA(1) process characterized 
by [4.7.7]. 

Note that the series e, in [4.7.16] could not be forecast as a linear function 
of lagged e or of lagged Y. Clearly, ¢ could be forecast, however, on the basis of 
lagged u or lagged v. The histories {u,} and {v,} contain more information than {e,} 
or {Y;}. The optimal forecast of Y,,, on the basis of {Y,, Y,_,, . . .} would be 


E(Yi+1¥o Y,-1,-+.-) = 6*e, 


with associated mean squared error o*?. By contrast, the optimal linear forecast 
of Y,,, on the basis of {u,, U,.1, «~~ 5 Ver Vea, - » -+ would be 


EW paillp Wavy oe Ve Vinay) = OY, 


with associated mean squared error 02 + o%. Recalling from [4.7.14] that |@*| < 
|8|, it appears from [4.7.9] that (6*?)o*? < 6702, meaning from [4.7.8] that 0? > 
o2 + 0, In other words, past values of Y contain less information than past values 
of u and v. 

This example can be useful for thinking about the consequences of differing 
information sets. One can always make a sensible forecast on the basis of what 
one knows, {Y,, Y,_1, . . .}, though usually there is other information that could 
have helped more. An important feature of such settings is that even though e,, 
u,, and v, are all white noise, there are complicated correlations between these 
white noise series. 

Another point worth noting is that all that can be estimated on the basis of 
{Y} are the two parameters 6* and o*?, whereas the true “structural” model [4.7.5] 
has three parameters (5, a2, and 02). Thus the parameters of the structural model 
are unidentified in the sense in which econometricians use this term—there exists 
a family of alternative configurations of 5, o2, and o2 with |8| < 1 that would 
produce the identical value for the likelihood function of the observed data {Y}. 

The processes that were added together for this example both had mean zero. 
Adding constant terms to the processes will not change the results in any interesting 
way—if X, is an MA(1) process with mean py and if v, is white noise plus a constant 
p»,, then X, + v, will be an MA(1) process with mean given by py + w,. Thus, 
nothing is lost by restricting the subsequent discussion to sums of zero-mean processes. 


4.7. Sums of ARMA Processes 105 


Adding Two Moving Average Processes 
Suppose next that X, is a zero-mean MA(q,) process: 
X, = (1+ &L + &L? +--+ + 8,,L%)u, = &(L)u,, 
with 
o forj=0 
Elin) = i otherwise. 
Let W, be a zero-mean MA(q,) process: 
W,= (1 +e L + mL? +++ + KL®)v, = K(L)y,, 
with 
oz forj =0 
E(v,v,-;) = \° io 


Thus, X has autocovariances yX, y¥,... , yX of the form of [3.3.12] while W has 
autocovariances yi’, yl’, ... , y% of the same basic structure. Assume that X and 
W are uncorrelated with each other at all leads and lags: 


E(X,W,_.) = 0 for allj; 


otherwise. 


and suppose we observe 
Y, = X, + W,. 
Define q to be the larger of q, or q2: 
q = max{q, 92}. 
Then the jth autocovariance of Y is given by 
E(YY,-;) E(x, + W,)(X,-; + W,_;) 
E(X,X,_;) + E(W.W,_,;) 
_ fre tyf  forj = 0, +1, +2,..., +¢ 
~ lo otherwise. 


I 


Thus the autocovariances are zero beyond q lags, suggesting that Y, might be 
represented as an MA(q) process. 

What more would we need to show to be fully convinced that Y, is indeed 
an MA(q) process? This question can be posed in terms of autocovariance-gen- 
erating functions. Since 

y= yt + yf, 
it follows that 
Dai = Dyke + DD yal. 
pea Lm = a= 
But these are just the definitions of the respective autocovariance-generating func- 
tions, 
8y(z) = gx(z) + 8w(z)- [4.7.19] 


Equation [4.7.19] is a quite general result—if one adds together two covariance- 
stationary processes that are uncorrelated with each other at all leads and lags, the 


106 Chapter 4 | Forecasting 


autocovariance-generating function of the sum is the sum of the autocovariance- 
generating functions of the individual series. 
If Y, is to be expressed as an MA(q) process, 


Y,= (1+ OL + @L? +°++ + 6,L%e, = OL)e, 
with 
E(ee,-;) = a for j = 0 
then its autocovariance-generating function would be 
8r(Z) = O(Z)6(z~)o?. 


otherwise, 


The question is thus whether there always exist values of (6,, 6,,... , 6,, 07) such 
that [4.7.19] is satisfied: 
0(z)6(z~1)o? = 8(z)8(z~1)o2 + K(z)K(z~1)o2. [4.7.20] 


It turns out that there do. Thus, the conjecture turns out to be correct that if two 
moving average processes that are uncorrelated with each other at all leads and 
lags are added together, the result is a new moving average process whose order 
is the larger of the order of the original two series: 


MA(qi) + MA(q2) = MA(max{q;, q2}). [4.7.21] 


A proof of this assertion, along with a constructive algorithm for achieving the 
factorization in [4.7.20], will be provided in Chapter 13. 


Adding Two Autoregressive Processes 

Suppose now that X, and W, are two AR(1) processes: 
(1 — mL)X, = u, [4.7.22] 
(1 — pL)W, = v,, [4.7.23] 


where u, and v, are each white noise with u, uncorrelated with v, for all ¢ and r. 
Again suppose that we observe 


Y,=X,+ W, 


and want to forecast Y,,, on the basis of its own lagged values. 
If, by chance, X and W share the same autoregressive parameter, or 


T= p, 
then [4.7.22] could simply be added directly to [4.7.23] to deduce 
(1 -— mL)X, + (1 -— wL)W, = u, + vy, 
or 
(1 — wL)(X, + W,) = u, + v,. 


But the sum u, + v, is white noise (as a special case of result [4.7.21]), meaning 
that Y, has an AR(1) representation 


(1 — wL)Y, = &, 


In the more likely case that the autoregressive parameters 7 and p are dif- 
ferent, then [4.7.22] can be multiplied by (1 — pL): 


(1 — pL) — wL)X, = (1 — pLyu,; [4.7.24] 


4.7. Sums of ARMA Processes 107 


and similarly, [4.7.23] could be multiplied by (1 — ZL): 
(1 — wL)(1 — pL)W, = (1 ~ wLyy,. [4.7.25] 
Adding [4.7.24] to [4.7.25] produces 
(1 — pL)(1 — wL)(X, + W,) = (1 - pLyu, + (1 — wL)v, [4.7.26] 


From [4.7.21], the right side of [4.7.26] has an MA(1) representation. Thus, we 
could write 


(1 — $,L — $,L7)Y, = (1 + 6L)e,, 
where 
(1 — &L - 1%) = (1 - pL)(t - aL) 
and 
(1 + OL)e, = (1 — pL)u, + (1 — wL)y,. 
In other words, 
AR(1) + AR(1) = ARMA(2, 1). [4.7.27] 
In general, adding an AR(p,) process 
a(L)X, = u,, 
to an AR(p,) process with which it is uncorrelated at all leads and lags, 
A(L)W, = vr, 
produces an ARMA(p, + p2, max{p,, p2}) process, 
P(L)Y, = (L)e,, 
where 
o(L) = m(L)e(L) 
and 
A(L)e, = p(L)u,+ m(L)y,. 


4.8. Wold’s Decomposition and the Box-Jenkins 
Modeling Philosophy 


Wold’s Decomposition 


All of the covariance-stationary processes considered in Chapter 3 can be 
written in the form 


—s » WjEr—j> [4.8.1] 
fe 


where e, is the white noise error one would make in forecasting Y, as a linear 
function of lagged Y and where 27_yy? < ™ with yo = 1. 

One might think that we were able to write all these processes in the form 
of [4.8.1] because the discussion was restricted to a convenient class of models. 
However, the following result establishes that the representation [4.8.1] is in fact 
fundamental for any covariance-stationary time series. 


108 Chapter 4 | Forecasting 


Proposition 4.1: (Wold’s decomposition). Any zero-mean covariance-stationary 
process Y, can be represented in the form 


Y, = s We) + Ke [4.8.2] 
{=O 
where ty = 1 and Dji9h? < ©. The term e, is white noise and represents the error 
made in forecasting Y, on the basis of a linear function of lagged Y: 


e, = Y, — E(VAY,-1, Y,-2,-- +). [4.8.3] 


The value of x, is uncorrelated with e,_, for any j, though x, can be predicted 
arbitrarily well from a linear function of past values of Y: 


K, = E(xlY,-1 Y,-2, oe ): 


The term «, is called the linearly deterministic component of Y,, while 2/29 Wé:-, 
is called the linearly indeterministic component. If x, = 0, then the process is called 
purely linearly indeterministic. 

This proposition was first proved by Wold (1938).° The proposition relies on 
stable second moments of Y but makes no use of higher moments. It thus describes 
only optimal linear forecasts of Y. 

Finding the Wold representation in principle requires fitting an infinite num- 
ber of parameters (¥,, Y2, .. .) to the data, With a finite number of observations 
on(Y,, Y2,..., Y7), this will never be possible. As a practical matter, we therefore 
need to make some additional assumptions about the nature of (,, y,...). A 
typical assumption in Chapter 3 was that y(L) can be expressed as the ratio of two 
finite-order polynomials: 


3 yl _ WL) 1+ Ob + &L? +++: + OL! 
=o” o(L) 1-$L- $1? - +++ ~- oL? 


Another approach, based on the presumed “‘smoothness” of the population spec- 
trum, will be explored in Chapter 6. 


[4.8.4] 


The Box-Jenkins Modeling Philosophy 


Many forecasters are persuaded of the benefits of parsimony, or using as few 
parameters as possible. Box and Jenkins (1976) have been influential advocates of 
this view. They noted that in practice, analysts end up replacing the true operators 
6(L) and ¢(L) with estimates 6(L) and ¢(L) based on the data. The more param- 
eters to estimate, the more room there is to go wrong. 

Although complicated models can track the data very well over the historical 
period for which parameters are estimated, they often perform poorly when used for 
out-of-sample forecasting. For example, the 1960s saw the development of a number 
of large macroeconometric models purporting to describe the economy using hundreds 
of macroeconomic variables and equations, Part of the disillusionment with such efforts 
was the discovery that univariate ARMA models with small values of p or q often 
produced better forecasts than the big models (see for example Nelson, 1972).? As 
we shall see in later chapters, large size alone was hardly the only liability of these 
large-scale macroeconometric models. Even so, the claim that simpler models provide 
more robust forecasts has a great many believers across disciplines. 


®See Sargent (1987, pp. 286-90) for a nice sketch of the intuition behind this result. 
7For more recent pessimistic evidence about current large-scale models, see Ashley (1988). 


4.8, Wold’s Decomposition and the Box-Jenkins Modeling Philosophy 109 


The approach to forecasting advocated by Box and Jenkins can be broken 
down into four steps: 


(1) Transform the data, if necessary, so that the assumption of covariance- 
stationarity is a reasonable one. 


(2) Make an initial guess of small values for p and g for an ARMA(p, q) model 
that might describe the transformed series. 


(3) Estimate the parameters in @(L) and 6(L). 


(4) Perform diagnostic analysis to confirm that the model is indeed consistent 
with the observed features of the data. 


The first step, selecting a suitable transformation of the data, is discussed in 
Chapter 15. For now we merely remark that for economic series that grow over 
time, many reséarchers use the change in the natural logarithm of the raw data. 
For example, if X, is the level of real GNP in year ¢, then 

= log X, — log X,_, [4.8.5] 


might be the variable that an ARMA model purports to describe. 

The third and fourth steps, estimation and diagnostic testing, will be discussed 
in Chapters 5 and 14. Analysis of seasonal dynamics can also be an important part 
of step 2 of the procedure; this is briefly discussed in Section 6.4, The remainder 
of this section is devoted to an exposition of the second step in the Box-Jenkins 
procedure on nonseasonal data, namely, selecting candidate values for p and q.® 


Sample Autocorrelations 
An important part of this selection procedure is to form an estimate {; of the 
population autocorrelation p;. Recall that p, was defined as 
Pj = ¥;!Yo 
where 
y = EY, — w)(¥-; — 4): 
A natural estimate of the population autocorrelation p, is provided by the 
corresponding sample moments: 
By = 4/%0 
where 


-WN(y-,- 9) forj=0,1,2,...,T-1 [4.8.6] 


T 
TPO 
-_1¥< 
Tye [4.8.7] 
Note that even though only T — j observations are used to construct 7,, the 
denominator in [4.8.6] is T rather than T — j. Thus, for large j, expression [4.8.6] 
shrinks the estimates toward zero, as indeed the population autocovariances go to 


zero as j > ©, assuming cOvariance-stationarity. Also, the full sample of obser- 
vations is used to construct J. 


§Box and Jenkins refer to this step as “identification” of the appropriate model. We avoid Box and 
Jenkins’s terminology, because ‘‘identification” has a quite different meaning for econometricians. 


110 Chapter 4 | Forecasting 


Recall that if the data really follow an MA(q) process, then p, will be zero 
for j > q. By contrast, if the data follow an AR(p) process, then p, will gradually 
decay toward zero as a mixture of exponentials or damped sinusoids. One guide 
for distinguishing between MA and AR representations, then, would be the decay 
properties of p,. Often, we are interested in a quick assessment of whether p, = 0 
forj=qt+i1,q+2,....lIfthe data were really generated by a Gaussian MA(q) 
process, then the variance of the estimate A, could be approximated by® 


1 q 
Var(6, “tf +2 ot} forj=qti,qt+2,.... [4.8.8] 


Thus, in particular, if we suspect that the data were generated by Gaussian white 
noise, then 4; for any j # 0 should lie between +2/V/T about 95% of the time. 
In general, if there is autocorrelation in the process that generated the original 
data {Y,}, then the estimate 4, will be correlated with 4; fori # j.!° Thus patterns 
in the estimated 6, may represent sampling error rather than patterns in the true p,. 


Partial Autocorrelation 


Another useful measure is the partial autocorrelation. The mth population 
partial autocorrelation (denoted a”) is defined as the last coefficient in a linear 
projection of Y on its m most recent values (equation [4.3.7]): 


Pcie ha ann (Y, a 7) as al (Y,_4 "e bt) hee al”) (Yi-mat = Mt). 


We saw in equation [4.3.8] that the vector a” can be calculated from 


af”) Yo Re. Se pee | Pe 

ag”) Nn Yo '"'" Ym-2 Y2 
(m) vee 

On Yn-1 Ym-2 Yo Ym 


Recall that if the data were really generated by an AR(p) process, only the p most 
recent values of Y would be useful for forecasting. In this case, the projection 
coefficients on Y’s more than p periods in the past are equal to zero: 


a™=0 form=pt+i,p+2,.... 


By contrast, if the data really were generated by an MA(q) process with g = 1, 
then the partial autocorrelation a”) asymptotically approaches zero instead of 
cutting off abruptly. 

A natural estimate of the mth partial autocorrelation is the last coefficient in 
an OLS regression of y on a constant and its m most recent values: 


Mar = e+ ayy, e ay, Bae edt 3 Cy, tt + é, 


where é, denotes the OLS regression residual. If the data were really generated by 
an AR(p) process, then the sample estimate (@””) would have a variance around 
the true value (0) that could be approximated by! 


Var(a™)=1/T form=pt+1,p+2,.... 
*See Box and Jenkins (1976, p. 35). 


'0A gain, see Box and Jenkins (1976, p. 35). 
"Box and Jenkins (1976, p. 65). 


4.8. Wold’s Decomposition and the Box-Jenkins Modeling Philosophy 111 


Moreover, if the data were really generated by an AR(p) process, then & and 
& would be asymptotically independent for i, j > p. 


Example 4.1 

We illustrate the Box-Jenkins approach with seasonally adjusted quarterly data 
on U.S. real GNP from 1947 through 1988. The raw data (x,) were converted 
to log changes (y,) as in [4.8.5]. Panel (a) of Figure 4.2 plots the sample 


autocorrelations of y (4; for j = 0, 1, ... , 20), while panel (b) displays the 
sample partial autocorrelations (@° for m = 0, 1,..., 20). Ninety-five 


percent confidence bands (+2/\/7) are plotted on both panels; for panel (a), 
these are appropriate under the null hypothesis that the data are really white 
noise, whereas for panel (b) these are appropriate if the data are really a 
erated by an AR(p) process for p less than m. 


1.2 
0 
=1.2 
° 10 20 
Lag (i) 
(a) Sample autocorrelations 
1.2 
fs) 
-1.2 
0 10 20 
Lag (m) 


(b) Sample partial autocorrelations 


FIGURE 4.2 Sample autocorrelations and partial autocorrelations for U.S. quar- 
terly real GNP growth, 1947:II to 1988:IV. Ninety-five percent confidence intervals 
are plotted as +2//T. 


112 Chapter 4 | Forecasting 


The first two autocorrelations appear nonzero, suggesting that g = 2 
would be needed to describe these data as coming from a moving average 
process. On the other hand, the pattern of autocorrelations appears consistent 
with the simple geometric decay of an AR(1) process, 


with @ = 0.4. The partial autocorrelation could also be viewed as dying out 
after one lag, also consistent with the AR(1) hypothesis. Thus, one’s initial 
guess for a parsimonious model might be that GNP growth follows an AR(1) 
process, with MA(2) as another possibility to be considered. 


APPENDIX 4.A. Parallel Between OLS Regression 
and Linear Projection 


This appendix discusses the parallel between ordinary least squares regression and linear 
projection. This parallel is developed by introducing an artificial random variable specifically 
constructed so as to have population moments identical to the sample moments of a particular 
sample. Say that in some particular sample on which we intend to perform OLS we have 
observed T particular values for the explanatory vector, denoted x,, x.,... , Xr. Consider 
an artificial discrete-valued random variable & that can take on only one of these particular 
T values, each with probability (1/7): 


Plt =x} = 1/T 
Plé = x} = 1/T 
P{é = x;} = 1/T. 


Thus & is an artificially constructed random variable whose population probability distri- 
bution is given by the empirical distribution function of x,. The population mean of the 
random variable & is 


Ee) = > acPiE = x} = 5D 2 


Thus, the population mean of equals the observed sample mean of the true random variable 
X,. The population second moment of & is 


ta 1 Z , 
E(B) = 5D xxi, [4.4.1] 
which is the sample second moment of (x,, x,,.-. , Xr). 

We can similarly construct a second artificial variable w that can take on one of the 
discrete values (y2, y3,. ++» Yra1)- Suppose that the joint distribution of w and € is given 
by , 

P{é=x,0=y.1 =1/T fort =1,2,...,T. 
Then 
1 T 
E(w) = 7s Xi Vrvts [4.4.2] 


The coefficient for a linear projection of w on & is the value of a that minimizes 
1 T 
E(w — a'8)? = 2 (tra — x) [4.4.3] 


This is algebraically the same problem as choosing B so as to minimize [4.1.17]. Thus, 
ordinary least squares regression (choosing B so as to minimize [4.1.17]) can be viewed as 
a special case of linear projection (choosing a so as to minimize (4.A.3]). The value of a 


Appendix 4.A. Parallel Between OLS Regression and Linear Projection 113 


that minimizes [4.4.3] can be found from substituting the expressions for the population 
moments of the artificial random variables (equations [4.4.1] and [4.A.2]) into the formula 
for a linear projection (equation [4.1.13]): 


eae pi fs eee 
a = (E(é')|"'E(Ew) = |= > xx;| E 2 se} 
T t=l T ral 
Thus the formula for the OLS estimate b in [4.1.18] can be obtained as a special case of 
the formula for the linear projection coefficient @ in [4.1.13]. 

Because linear projections and OLS regressions share the same mathematical struc- 
ture, statements about one have a parallel in the other. This can be a useful device for 
remembering results or confirming algebra. For example, the statement about population 
moments, 


E(Y?) = Var(Y) + [E(Y)F, [4.4.4] 
has the sample analog 


> yi = > (y — ¥P + (YP [4.4.5] 


with ¥ = (1/7)371y,. 

As a second example, suppdse that we estimate a series of n OLS regressions, with 
y, the dependent variable for the ith regression and x, a (k x 1) vector of explanatory 
variables common to each regression. Let y, = (Yi Yar» + + » Yu)’ and write the regression 
model as 


-y, = I'x, + u, 


for II‘ an (n x k) matrix of regression coefficients. Then the sample variance-covariance 
matrix of the OLS residuals can be inferred from [4.1.24]: 


i. Pie 1s iz, ie 1s 
7z 4,4; = [33 19] a BE rai|[23 sxi| 43 ssi], [4.4.6] 


where a, = y, — TI’x, and the ith row of Il’ is given by 


; 1Z =i iz : 
tr = Fz Tm Mu . 


APPENDIX 4.B. Triangular Factorization 
of the Covariance Matrix for an MA(I) Process 


This ae establishes that the triangular factorization of 9 in [4.5.17] is given by [4.5.18] 
and [4.5.19]. 

The magnitude a? is simply a constant term that will end up multiplying every term 
in the D matrix. Recognizing this, we can initially solve the factorization assuming that 
a? = 1, and then multiply the resulting D matrix by a? to obtain the result for the general 
case. The (1, 1) element of D (ignoring the factor a7) is given by the (1, 1) element of 2: 
d,, = (1 + 6). To put a zero in the (2, 1) position of 21, we multiply the first row of 2 
by 6/(1 + 6?) and subtract the result from the second; hence, a,, = 0/(1 + 6%). This 
operation changes the (2, 2) element of M to 


@ (1+ e)?-@ 1+ 62+ 6 
d = (1 2) = Se 
Sake nate 1+ @ 1+ 6 


To put a zero in the (3, 2) element of 1, the second row of the new matrix must be multiplied 
by 6/d. and then subtracted from the third row; hence, 


2" 
ds = Old, = E+ ©) 


1+ +o 


114 Chapeer 4 | Forecasting 


This changes the (3, 3) element to 


2, 
_ (1 + #1 + @ + 6) - 1 + 6?) 
7 1+ @+ 6 
(++ 6) + OU + + 4) - (1 + 6) 
1+ @+ 6 
1+@+ 6+ 
1+ @ + 6 


In general, for the ith row, 
Pi 14+ @+ 6 +--+: + 6 
“Lt + OF tree + GUD 
To put a zero in the (¢ + 1, i) position, multiply by 
Gers = Old, = at ie : ee 
and subtract from the (f + 1)th row, producing 


= 2 2 4 one amie | 

Gein. = (1 + g) - Ft 

Qt e+ ete: + O/t ot er+ Ate: + O) 
1+ 6 +O +--+ + 9 

ofl + + OF +--+ + OXY] 

LEO te + OF 

—1t +H tees + Hiry 

Lt OHO tere + OF 


Chapter 4 Exercises 


4.1. Use formula [4.3.6] to show that for a covariance-stationary process, the projection 
of Y,,, on a constant and Y, is given by 

E(¥.1¥,) = (1 - pe + a, 
where » = E(Y,) and p, = y/%- 
(a) Show that for the AR(1) process, this reproduces equation [4.2.19] for s = 1. 
(b) Show that for the MA(1) process, this reproduces equation [4.5.20] for n = 2. 
(c) Show that for an AR(2) process, the implied forecast is 

w+ [6,/(1 - o)(Y, — y). 

Is the error associated with this forecast correlated with Y,? Is it correlated with Y,_,? 
4,2. Verify equation [4.3.3]. 


4.3. Find the triangular factorization of the following matrix: 


1-2 3 
-2 6 -4 
3 ~4 12 


4.4. Can the coefficient on Y, from a linear projection of Y, on Y;, Y,, and Y, be found 
from the (4, 2) element of the matrix A from the triangular factorization of 9 = E(YY'‘)? 


4.5. Suppose that X, follows an AR(p) process and vy, is a white noise process that is 
micorrelated with X,_, for all j. Show that the sum 


Y¥,=X,+y, 
follows an ARMA(p, p) process. 


Chapter 4 Exercises 115 


4.6. Generalize Exercise 4.5 to deduce that if one adds together an AR(p) process with 
an MA(q) process and if these two processes are uncorrelated with each other at all leads 
and lags, then the result is an ARMA(p, p + q) process. 


Chapter 4 References - 


Ashley, Richard. 1988. ‘“‘On the Relative Worth of Recent Macroeconomic Forecasts.” 
International Journal of Forecasting 4:363-76. 

Box, George E. P., and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and 
Control, rev. ed. San Francisco: Holden-Day. 

Nelson, Charles R. 1972. “The Prediction Performance of the F.R.B.—M.1.T.-PENN Model 
of the U.S. Economy.” American Economic Review 62:902—17. 

Sargent, Thomas J. 1987. Macroeconomic Theory, 2d ed. Boston: Academic Press. 

Wold, Herman. 1938 (2d ed. 1954). A Study in the Analysis of Stationary Time Series. 
Uppsala, Sweden: Almaqvist and Wiksell. 


116 Chapter 4 | Forecasting 


Maximum Likelihood 
Estimation 


5.1. Introduction 
Consider an ARMA model of the form 


Y, = ¢ + hYi-1 + O2Y,-2 + °° ° + GYi-p + & + HE-1 [5.1.1] 
+ O69 + °° * + 9,8, 
with e, white noise: 


E(e) = 0 [5.1.2] 
o fort=rT 
a {° otherwise. [5:43] 
The previous chapters assumed that the population parameters (c, di, ..., dp, 
6,,..+ , 0, 77) were known and showed how population moments such as E(Y,Y,- ) 


and near forecasts E Y,as|¥;, Y:-1) - - -) could be calculated as functions of these 
population parameters. This chapter explores how to estimate the values of (c, ¢,, 


++ Pp 91, » » + , Og, 07) On the basis of observations on Y. 

. The primary principle on which estimation will be based is maximum likeli- 
hood. Let © = (c, gd, ..., Bp +++ 16g, 07)’ denote the vector of population 
parameters. Suppose we have observed a sample of size T (yi, yo, .- - , yr). The 
approach will be to calculate the probability density 

Fen¥ ri yen¥ Yr Yrais sy ¥15 8), [5.1.4] 


which might loosely be viewed as the probability of having observed this particular 
sample, The maximum likelihood estimate (MLE) of @ is the value for which this 
sample is most likely to have been observed; that is, it is the value of @ that 
maximizes [5.1.4]. 

This approach requires specifying a particular distribution for the white noise 
process «,. Typically we will assume that e, is Gaussian white noise: 


e, ~ iid. N(0, 02). [5.1.5] 


Although this assumption is strong, the estimates of @ that result from it will often 
turn out to be sensible for non-Gaussian processes as well. 

Finding maximum likelihood estimates conceptually involves two steps. First, 
the likelihood function [5.1.4] must be calculated. Second, values of @ must be 
found that maximize this function. This chapter is organized around these two 
steps. Sections 5.2 through 5.6 show how to calculate the likelihood function for 
different Gaussian ARMA specifications, while subsequent sections review general 
techniques for numerical optimization. 


117 


5.2. The Likelihood Function for a Gaussian 
AR(1) Process 


Evaluating the Likelihood Function 
A Gaussian AR(1) process takes the form 
Y,=c + $Y,-, + & . [5.2.1] 
with ¢, ~ iid. N(0, 02). For this case, the vector of population parameters to be 
estimated consists of @ = (c, ¢, a7)’. 


Consider the probability distribution of Y,, the first observation in the sample. 
From equations [3.4.3] and [3.4.4] this is a random variable with mean 


E(Y,) = «= cil - $) 
and variance 
E(Y¥, - #)? = o7/(1 — ¢?). 


Since {e}%. _. is Gaussian, Y, is also Gaussian. Hence, the density of the first 
observation takes the form 


fy,Ous @) = pAGTH c, b, a”) 
= 1 exp ={y: - [eG = oP] [5.2.2] 
Vi0 Vol — $2) 207/(1 — ?) , 
Next consider the distribution of the second observation Y, conditional on observing 
Y, = y,. From [5.2.1], 
Y,=c+ OY, + &. [5.2.3] 


Conditioning on Y, = y, means treating the random variable Y, as if it were the 
deterministic constant y,. For this case, [5.2.3] gives Y, as the constant (c + ¢y,) 
plus the N(0, o?) variable e.. Hence, 


(YY, al yi) oe N((c + oy); 0”), 
Beit ed —Q2 = ¢ — oy? 
Frav.Orlys 8) = Vinoi exp| 02 eo - [5.2.4] 
The joint density of observations 1 and 2 is then just the product of [5.2.4] and [5.2.2]: 


Feyy,(Y2. ¥15 8) = fray, aly ®)-fy,(11; 8). 
Similarly, the distribution of the third observation conditional on the first two is 


. — 1 —(y3 — c — dy,)? 
fravax, lye Vis 6) = amo2 exp [Ome ea] ; 


meaning 


from which 


Fegvev, V3: Yas Yas 6) = frav,v,(vsl¥o, ¥15 9) fy, y, (yes 13 8)- 
In general, the values of Y,, Y,,..., Y,-1 matter for Y, only through the 
value of Y,_,, and the density of observation ¢ conditional on the preceding ¢ — 1 
observations is given by 
Fray, ¥, gen ride v Yeas ss Yu 8) 
= fase... Crd ea; 6) 
5 = pas 2 
exo| (y. - ¢ — by:-1) | 


207 


[5.2.5] 


Vino 
118 Chapter 5 | Maximum Likelihood Estimation 


The joint density of the first t observations is then 


Fev ny Wr Vrats+ +e Vas 6) [5 2 6] 
= fray dye ) Fy, yuri ents Yen as eB). 
The likelihood of the complete sample can thus be calculated as 
T 
Fey anee¥ Irs Vrain Vs 8) = fs 9)-I] fray, ,Odyr-15 8). [5.2.7] 


The log likelihood function (denoted £(@)) can be found by taking logs of [5.2.7]: 


T 
£(8) = log fr,(y13 @) + 2 log friv,_«(vdlye—13 8). [5.2.8] 


Clearly, the value of @ that maximizes [5.2.8] is identical to the value that 
maximizes [5.2.7]. However, Section 5.8 presents a number of useful results that 
can be calculated as a by-product of the maximization if one always poses the 
problem as maximization of the log likelihood function [5.2.8] rather than the 
likelihood function [5.2.7]. 

Substituting [5.2.2] and [5.2.5] into [5.2.8], the log likelihood for a sample 
of size T from a Gaussian AR(1) process is seen to be 


£(0) = —4log(2m) — }log[o/(1 - $7)] 
_ On [ela - oP 


eu an: We eT eee) [5.2.9] 
r 2 
~ [(T — 1)2] log(o?) - > [Ga cae) 


An Alternative Expression for the Likelihood Function 


A different description of the likelihood function for a sample of size T from 
a Gaussian AR(1) process is sometimes useful. Collect the full set of observations 
in a (T X 1) vector, 


Y =(yyYa--- > Yr). 
(Tx 1) 


This vector could be viewed as a single realization from a T-dimensional Gaussian 
distribution. The mean of this (JT x 1) vector is 


E(Y;) fod 
Evy. iv 
: ale me [5.2.10] 
E(Y7) lied 
where, as before, 4 = c/(1 — @). In vector form, [5.2.10] could be written 
E(Y) = p, 


where p denotes the (T X 1) vector on the right side of [5.2.10]. The variance- 
covariance matrix of Y is given by 


E((¥ — w)(¥ — p)‘] = , [5.2.11] 


5.2. The Likelihood Function for a Gaussian AR(1) Process 119 


where 


E(Y, — zy E(Y, — w)(¥2- 2) + °° E(Y, - w)(¥r - 2) 
_ | BR - wh - B) E(¥, — p) t+ E(Y, — p)(¥r - 2) 
E(¥;- p)(%-— #) E(¥r- pw), - pw) E(¥; — p)? 
[5.2.12] 


The elements of this matrix correspond to autocovariances of Y. Recall that the 
jth autocovariance for an AR(1) process is given by 


E(Y, ~ w)(¥,-) — #) = o4l(1 ~ $2). [5.2.13] 
Hence, [5.2.12] can be written as : 


0 = o°V, [5.2.14] 
where 
1 ¢ ¢? ws eee oT} 
1 d re pt? 
v= iF a ae ee ea [5.2.15] 
gt! gr? gr-3 cee i 


Viewing the observed sample y as a single draw from a N(p, 22) distribution, 
the sample likelihood could be written down immediately from the formula for the 
multivariate Gaussian density: 


fly; 0) = (2)~ 7? |-4|? exp[-Hy - p)'A-Ny - wp], — [5.2.16] 
with log likelihood 
£(8) = (- 7/2) log(2m) + 4 log|\M-"| — Hy — p)'A-Ny — wp). [5.2.17] 


Evidently, [5.2.17] and [5.2.9] must represent the identical function of (y,, y2,..., 
yr). To verify that this is indeed the case, define 


VI-@ 0 0: 0 0 
-¢ 1 O-:-: O O 
L = 0 SO hee 0 SOT, [5.2.18] 
tat : fi tauhs, ee 
0 0 O-:-- -¢ 1 
It is straightforward to show that? 
L'L = V-}, [5.2.19] 


"By direct multiplication, one calculates 


VI-€ OVi-€ VI-@ +s OT IVI- 


0 (1 — ¢”) o(1 im ¢’) eee e771 — $7) 
weet} 0 0 Q-¢) + 60 - 6%) 
1- ¢? ‘ 7 ‘ r 
0 0 0 en Gp) 


and premultiplying this by L' produces the (T x T) identity matrix. Thus, L'LV = I,, confirming 
[5.2.19]. 


120 Chapter 5 | Maximum Likelihood Estimation 


implying from [5.2.14] that 
Q-1 =o A7L'L. [5.2.20] 
Substituting [5.2.20] into [5.2.17] results in 
£(0) = (— T/2) log(27) + dloglo-7L'L| — Hy — p)'o“7L'L(y - p). [5.2.21] 
Define the (T x 1) vector f¥ to be 


y = Ly - p) 
VIi-@ 0 0 0 Olfy-z# 
-¢ 1 0 0 0 y2— & 
= 0 -¢ 1 0 Off y3- pw 
0 0 O-++ -@ lobyr- 4 [5.2.22] 
V1 - ¢* (y: — #) 


(y2 — B) -— O(n. — #) 
=| (3 - ») — (2 - #) 
(yr — #) — O(yr-1 — B) 
ci(1 — @), this becomes 


VI = ¢ [yi — c(1 - ¢)] 


tt 


Substituting 


Yo -C — by, 
y= ¥3 — C— hy2 
Yr — C7 OYr-1 


The last term in [5.2.21] can thus be written 


Hy — p)'o CL‘ L(y — p) = [1/(207)]¥'F 
= [1/Qo7)](1 ~ oly — c(i - ¢)P [5.2.23] 


+ (Qo) X (~~ bya 


The middle term in [5.2.21] is similarly 
}loglo-?L'L| = 4 log{o~?7 - |L'LI} 
= —}log o?7 + 4 log|L‘L| [5.2.24] 
(—T/2) log o? + logiL|, 
where use has been made of equations [A.4.8], [A.4.9], and [A.4.11] in the Math- 
ematical Review (Appendix A) at the end of the book. Moreover, since L is lower 


triangular, its determinant is given by the product of the terms along the principal 
diagonal: |L| = 1 — $2. Thus, [5.2.24] states that 


4 loglo ~?L'L| = (— 7/2) log o? + 4 log(1 — ¢). _ [5.2.25] 


Substituting [5.2.23] and [5.2.25] into [5,2.21] reproduces [5.2.9]. Thus, equations 
[5.2.17] and [5.2.9] are just two different expressions for the same magnitude, as 
claimed. Either expression accurately describes the log likelihood function. 


5.2. The Likelihood Function for a Gaussian AR(1) Process 121 


Expression [5.2.17] requires inverting a (J X T) matrix, whereas [5.2.9] does 
not. Thus, expression [5.2.9] is clearly to be preferred for computations. It avoids 
inverting a (T x T) matrix by writing Y, as the sum of a forecast (c + PY,_,) and 
a forecast error (e,). The forecast error is independent from previous observations 
by construction, so the log of its density is simply added to the log likelihood of 
the preceding observations. This approach is known as a prediction-error decom- 
position of the likelihood function. 


Exact Maximum Likelihood Estimates for the Gaussian 
AR(1) Process 


The MLE 6 is the value for which [5.2.9] is maximized. In principle, this 
requires differentiating [5.2.9] and setting the result equal to zero. In practice, 
when an attempt is made to carry this out, the result is a system of nonlinear 
equations in @ and (yi, yo, . . . , Yr) for which there is no simple solution for @ in 
terms of (y,, yo, . - -, Yr). Maximization of [5.2.9] thus requires iterative or nu- 
merical procedures described in Section 5.7. 


Conditional Maximum Likelihood Estimates 


An alternative to numerical maximization of the exact likelihood function is 
to regard the value of y, as deterministic and maximize the likelihood conditioned 
on the first observation, 


T 
fen yp enraly, Orr Yroas +++ 5 aly; 8) = LT few Ody @), [5.2.26] 


the objective then being to maximize 


log Fenvpeten¥al¥ Irs Yr-ay +++ Yalu 8) 


= -[(T - 1)/2] log(2m) — [(T - 1)2] log(o) [5.2.27] 
= > (4 — ¢ - fa) 
m2 207 ; 


Maximization of [5.2.27] with respect to c and ¢ is equivalent to minimization 
of 


T 
>> (y% — ¢ — $y,-1)?, [5.2.28] 


which is achieved by an ordinary least squares (OLS) regression of y, on a constant 
and its own lagged value. The conditional maximum likelihood estimates of c and 
@ are therefore given by 


pil ee) 
d Sy. Syhy Ly,-1yJ’ 


where & denotes summation over t = 2,3,..., T. 

The conditional maximum likelihood estimate of the innovation variance is 
found by differentiating [5.2.27] with respect to o? and setting the result equal to 
zero: 


-(T = 1) . iz =e PY 1)" “3 
2a? bs 3| 2a* - 0, 


122 Chapter 5 | Maximum Likelihood Estimation 


or 


62 = s [& eS dy), 
tm—2 T-1 

In other words, the conditional MLE is the average squared residual from the OLS 

regression [5.2.28]. 

In contrast to exact maximum likelihood estimates, the conditional maximum 
likelihood estimates are thus trivial to compute. Moreover, if the sample size T is 
sufficiently large, the first observation makes a negligible contribution to the total 
likelihood. The exact MLE and conditional MLE turn out to have the same large- 
sample distribution, provided that |p| < 1. And when |¢| > 1, the conditional MLE 
continues to provide consistent estimates, whereas maximization of [5.2.9] does 
not. This is because [5.2.9] is derived from [5.2.2], which does not accurately 
describe the density of Y, when |¢| > 1. For these reasons, in most applications 
the parameters of an autoregression are estimated by OLS (conditional maximum 
likelihood) rather than exact maximum likelihood. 


5.3. The Likelihood Function for a Gaussian 
AR(p) Process 


This section discusses a Gaussian AR(p) process, 


Y¥, = ¢ + GYi-1 + bY-2 ++ + GN, + Es [5.3.1] 
with e, ~ iid. N(O, 07). In this case, the vector of population parameters to be 
estimated is @ = (c, di, d2,- ++, bp, 0)’. 


Evaluating the Likelihood Function 


A combination of the two methods described for the AR(1) case is used to 
calculate the likelihood function for a sample of size T for an AR(p) process. The 
first p observations in the sample (y,, y2,... ,¥,) are collected ina (p x 1) vector 
yp, which is viewed as the realization of a p-dimensional Gaussian variable. The 
mean of this vector is 4,, which denotes a (p x 1) vector each of whose elements 
is given by 


w= cK1-d& —-h&-—+':- &,). [5.3.2] 
Let o?V, denote the (p X p) variance-covariance matrix of (Y,, Y2,..., Y,): 
E(¥, ~ 2)? EY, — w)(¥2— 2) +++ EY, ~ 2)(¥, — 2) 
E(¥, — w)(¥1 — 2) E(¥, - pb)? sts E(Y, — p)(¥, — #) 
ov, = : ; ; 
EY, — w)(¥%i — 2) EY, — we) — w) +e E(¥, — w) 
[5.3.3] 


For example, for a first-order autoregression (p = 1), V, is the scalar 1/(1 — $7). 
For a general pth-order autoregression, 


Yo “Ni Yo °° Vpat 
vi Yo Yi °° Yp-2 
ov, = v2 Nn Yo ‘'' Yp-3 f, 
Yp-1 Yp-2 Yp-3 °°" Yo 


5.3. The Likelihood Function for a Gaussian AR p Process 123 


where ¥,, the jth autocovariance for an AR(p) process, can be calculated using the 
methods in Chapter 3. The density of the first p observations is then that of a 
N(w,, 77V,) variable: 


Fe. 1% Vp» Yp-1 sens ¥13 6) 


1 ae 
= (2n)-|o-2V 51/12 exp] ~ hi, — Bp)'Ve (Vp — H| [5.3.4] 


1 ge 
= (22) -? (0-7) P| V5 U2 ex| ~ hi = By) V> lly, ~ a) , 


where use has been made of result [A.4.8]. 

For the remaining observations in the sample, (y,,1, Yp+2, +--+» Yr), the 
prediction-error decomposition can be used. Conditional on the first t — 1 obser- 
vations, the tth observation is Gaussian with mean 


c+ Piyr=1 + b2Yr-2 aie PpYr—p 


and variance o?. Only the p most recent observations matter for this distribution. 
Hence, for t > p, 


Fey, n% nv dyna Vena. eo Vas @) 


= Fett ¥inae¥i pl IAVe= 00 Ye-29 +++ > Yemps 6) 
a 1 —(% —C— hyena 7 Gryrnz 0m bp¥t-p)” 
~ \Vamot XP 2c? ‘ 


The likelihood function for the complete sample is then 


i poe ee OF 2 Yrat ++ +> ¥15 9) 
= Fy,.¥,-1 ie ¥(Yp Ypmir +++ 5 ¥13 9) 


[5.3.5] 
T 
x I] Fra, ¥ecse Xap Idem 3 Viadr sees Ye~ ps 6), 
tmp+i : 
and the log likelihood is therefore 
£(8) = log fep¥p- jn IP Yraots eo ¥t3 9) 
2 2.82 _P 1 -1 
5 log(27) 3 log(a?) + 5 log|V5"| 
1 , ~ 
oo 32¥e oe Hy) Vv; ly, _ H,) 
ae 5 P log(27) — x 5 P log(a”) 
(y: =~ ¢ — OiYrn1 — ON-2 TI bpYt—p)” [5.3.6] 


~ t=pt+l1 207 
T T 1 -_ 
Ses log(27) — 5 log(a?) + 5 losl¥, | 
1 ’ eo 
= F52e ~ Hp) V> lly, x, i) 
(¥ — ¢ — bir-1 — Moar TT bp¥t-p)” 
rep+l1 207 : 


Evaluation of [5.3.6] requires inverting the (p x p) matrix V,. Denote the 
row i, column j element of V;1 by v/(p). Galbraith and Galbraith (1974, equation 


124 Chapter 5 | Maximum Likelihood Estimation 


16, p. 70) showed that 


inl pti-j 
vi(p) = ls, Or bes j-1 — 7 > dabeer forlsisj<p, [5.3.7] 
= =pri- 
where ¢) = —1. Values of v/(p) for i > j can be inferred from the fact that V7} 


is symmetric (vi(p) = v“(p)). For example, for an AR(1) process, V7‘ is a scalar 
whose value is found by taking i = j = p = 1: 


0 1 
a > Gi Pe — p> oxd,| = (63 - $3) = (1 - ”). 


Thus o?V, = o7/(1 — $7), which indeed reproduces the formula for the variance 
of an AR(1) process. For p = 2, equation [5.3.7] implies 
a (1-63) -(@i + a 
2 — ’ 
—(ti + bids) = (1 — 63) 


from which one readily calculates 


a+ doo dh)  —h 


Ive = : 
ds (1- #) 


| = (1 + &)[(1 — &)? - $3] 
and 
(Y2 — B2)'Vz'(y2 — M2) 
= 2 = (1 — ¢2) —h ibe F | 
(1 -— #) O2 - w))G + | Le ise iN Hse) 
= (1 + 2) x {(1 — ¢2)(1 — »)? 
— 2bi(y1 — H)(¥2 — H) + (1 — o2)(¥2 — Hh. 
The exact log likelihood for a Gaussian AR(2) process is thus given by 


£(0) = -F log(2n) ~ Z log(o2) + Flog((1 + 4,)(0 — 4)? - 68 


- (fee x {0 = bd. - 
— 26i(vi — #)(V2 — H) + (1 - b)(y2 — ow) 


_ >) (¥, — ¢ — biY-1 — b2Y--2) 
fm3 207 , 


where uw = ci(1 — ¢, — ¢,). 


[5.3.8] 


Conditional Maximum Likelihood Estimates 


Maximization of the exact log likelihood for an AR(p) process [5.3.6] must 
be accomplished numerically. In contrast, the log of the likelihood conditional on 
the first p observations assumes the simple form 


10g Fyn Vp ynes¥pal¥peen¥y (YT Yr-tys-s Yp+il¥p» wees ¥15 8) 
_T-—p _T=p 2 
= 2 log(27) 2 log(a?) [5.3.9] 
ae e (Ye = © = Pin — G2Y-2 TT PpYt-p)* 
t=p+1 207 : 


5.3. The Likelihood Function for a Gaussian AR p Process 125 


The values of c, ¢1, $2, . . . , 6, that maximize [5.3.9] are the same as those that 
minimize 


De (= 6 ~ bern = Gada = 15+ = ben) [5.3.10] 


Thus, the conditional maximum likelihood estimates of these parameters can be 
obtained from an OLS regression of y,on a constant and p of its own lagged values. 
The conditional maximum likelihood estimate of a? turns out to be the average 
squared residual from this regression: 


6? == P 5, (%- é- P1Ys-1 — dryer tt PpYr~p)’- 
The exact maximum likelihood estimates and the conditional maximum likelihood 
estimates again have the same large-sample distribution. 


Maximum Likelihood Estimation for Non-Gaussian Time 
Series 


We noted in Chapter 4 that an OLS regression of a variable on a constant 
and p of its lags would yield a consistent estimate of the coefficients of the linear 
projection, 


BV ¥ isis Yinas +405 ¥ 3) 


provided that the process is ergodic for second moments. This OLS regression also 
maximizes the Gaussian conditional log likelihood [5.3.9]. Thus, even if the process 
is non-Gaussian, if we mistakenly form a Gaussian n icp likelihood function and 
maximize it, the resulting estimates (é, di, dr, ous , bp) will provide consistent 
estimates of the population parameters in [5.3.1]. 

An estimate that maximizes a misspecified likelihood function (for example, 
an MLE calculated under the assumption of a Gaussian process when the true data 
are non-Gaussian) is known as a quasi-maximum likelihood estimate. Sometimes, 
as turns out to be the case here, quasi-maximum likelihood estimation provides 
consistent estimates of the population parameters of interest. However, standard 
errors for the estimated coefficients that are calculated under the Gaussianity 
assumption need not be correct if the true data are non-Gaussian.” 

Alternatively, if the raw data are non-Gaussian, sometimes a simple trans- 
formation such as taking logs will produce a Gaussian time series. For a positive 
random variable Y,, Box and Cox (1964) proposed the general class of transfor- 
mations 


forA #0 
log Y, for A = 0. 


One a approach is to pick a particular value of A and maximize the likelihood function 
for ¥™ under the assumption that Y) is a Gaussian ARMA process. The value 
of A that i is associated with the highest value of the maximized likelihood is taken 
as the best transformation. However, Nelson and Granger (1979) reported dis- 
couraging results from this method in practice. 


?These points were first raised by White (1982) and are discussed further in Sections 5.8 and 14.4. 


126 Chapter 5 | Maximum Likelihood Estimation 


Li and McLeod (1988) and Janacek and Swift (1990) described approaches 
to maximum likelihood estimation for some non-Gaussian ARMA models. Martin 
(1981) discussed robust time series estimation for contaminated data. 


5.4. The Likelihood Function for a Gaussian 
MA(I) Process 


Conditional Likelihood Function 


Calculation of the likelihood function for an autoregression turned out to be 
much simpler if we conditioned on initial values for the Y’s. Similarly, calculation 
of the likelihood function for a moving average process is simpler if we condition 
on initial values for the e’s. 

Consider the Gaussian MA(1) process 


Y,=ute, + e,_, [5.4.1] 


with e, ~ i.i.d. N(0, 07). Let @ = (u, 8, a7)’ denote the population parameters to 
be estimated. If the value of ¢,., were known with certainty, then 


Yile-1 es N((w + 6€,_1); a?) 


or 


1 =(Y, —  — 6€,-1)? 
Frye, Yd r—13 6) = Vinot exp| We b= Oe) [5.4.2] 
Suppose that we knew for certain that «9 = 0. Then 
(Yile9 = 0) ~ N(u, 07). 


Moreover, given observation of y,, the value of €, is then known with certainty as 
well: 


&) = Vi — Bs 
allowing application of [5.4.2] again: 


1 —(y2 — &— 9€,)? 
Fratyeo=0( YalY1> 9 = 0; @) 7 V2102 exp| Oa 4 = "| 


Since ¢, is known with certainty, e, can be calculated from 


&2 = yo — mM — 98. 
Proceeding in this fashion, it is clear that given knowledge that €) = 0, the full 


sequence {€,, €2,. .. , 7} can be calculated from {y,, y2,... , yr} by iterating on 
& =), — MB — Be,_, [5.4.3] 
fort = 1, 2,..., T, starting from €9 = 0. The conditional density of the ‘th 
observation can then be calculated from [5.4.2] as 
Fey .¥pczcee¥ 002 0 Ye— 19 Yr-ar + + + Va» &o = 0; @) 
= Fije,_,(ydee-a3 @) - [5.4.4] 


ean Se aa 
— Wiae Pi oo? |" 


5.4, The Likelihood Function for a Gaussian MA(1) Process 127 


The sample likelihood would then be the product of these individual densities: 
Ppp jeen¥ileon YT) Yr= 1 - . + ¥il€ = 0; 6) 
Tr 
= Fvileg=0( Yiléo = 0; 6) TD fete re sen tieonO Wd Vea Yeas +++ a Yi» Eo = 0; 8). 


The conditional log likelihood is 
£(@) 


fl 


log fr, ¥p—p..¥ileo=0YTs Yr—1y ++» + Yal&o = 03 @) [5.4.5] 
a2 ina) oe lone = Sth 
ae en a a 4 20? 


For a particular numerical value of @, we thus calculate the sequence of e’s 
implied by the data from [5.4.3]. The conditional log likelihood [5.4.5] is then a 
function of the sum of squares of these e’s. Although it is simple to program this 
iteration by computer, the log likelihood is a fairly complicated nonlinear function 
of 4 and 6, so that an analytical expression for the maximum likelihood estimates 
of «% and @ is not readily calculated. Hence, even the conditional maximum like- 
lihood estimates for an MA(1) process must be found by numerical optimization. 

Iteration on [5.4.3] from an arbitrary starting value of € will result in 


& = (¥ — B) — Ay-1 — B) + O(H-2 — Bw) -— 
+ (=1)28y, — pw) + (1 ep 


If |6| is substantially less than unity, the effect of imposing €) = 0 will quickly die 
out and the conditional likelihood [5.4.4] will give a good approximation to the 
unconditional likelihood for a reasonably large sample size. By contrast, if [6 > 1, 
the consequences of imposing €) = 0 accumulate over time. The- conditional ap- 
proach is not reasonable in such a case. If numerical optimization of [5.4.5] results 
in a value of @ that exceeds 1 in absolute value, the results must be discarded. The 
numerical optimization should be attempted again with the reciprocal of 6 used as 
a starting value for the numerical search procedure. 


Exact Likelihood Function 


Two convenient algorithms are available for calculating the exact likelihood 
function for a Gaussian MA(1) process. One approach is to use the Kalman filter 
discussed in Chapter 13. A second approach uses the triangular factorization of 
the variance-covariance matrix. The second approach is described here: 

As in Section 5.2, the observations on y can be collected in a (T X 1) vector 
Y = (yp Ya, - + Yr)’ With mean p = (w, w,..., uw)’ and (T x T) variance- 
covariance matrix 


Q = EY — w)¥ - py’. 


The variance-covariance matrix for T consecutive draws from an MA(1) process is 


(1+6) 86 a 
gfe GE. See: SG 

a=0] 0 6 (+8) ++ 0 
0 0 0. se-@a 


The likelihood function is then ; 
Fely; @) = (22)~7|Q|~ exp[—Hy — p)'Q-*(y — p)]. [5.4.6] 


128 Chapter 5 | Maximum Likelihood Estimation 


A prediction-error decomposition of the likelihood is provided from the tri- 
angular factorization of 2, 

Q = ADA’, [5.4.7] 
where A is the lower triangular matrix given in [4.5.18] and D is the diagonal matrix 
in [4.5.19]. Substituting [5.4.7] into [5.4.6] gives 

Felys @) = (24)-™|ADA'|-1? [5.4.8] 
x exp[—i(y — w)'[A’]D“IA“Hy — p)]. 
But A is a lower triangular matrix with 1s along the principal diagonal. Hence, 
|A| = 1 and 
|ADA’| = |Al-[DJ-A’| = [D]. 
Further defining 


y= Ay — #), [5.4.9] 
the likelihood [5.4.8] can be written 
Fely; 6) = (2)~7? [D|-'? exp[—39'D~ 15]. [5.4.10] 


Notice that [5.4.9] implies 
AY =y- HB. 
The first row of this system states that ¥,= y, — mu, while the th row implies that 
= OE te O8 aghast eA 


de =y,—-—-e 1: g2 a 9? agrees 920-0 Yra1: [5.4.11] 
The vector ¥ can thus be calculated by iterating on [5.4.11] fort = 2,3,...,T 
starting from y, = y, — mw. The variable y, has the interpretation as the residual 
from a linear projection of y, on a constant and y,_;, y,-2,....» i, while the ¢th 
diagonal element of D gives the MSE of this linear projection: 
- 1+@+64+--- + 
== DY eel = py i 
d, = E(Y2 =o TeO eee yp [5.4.12] 


Since D is diagonal, its determinant is the product of the terms along the principal 
diagonal, 


ID| = J] 4,, [5.4.13] 


while the inverse of D is obtained by taking reciprocals of the terms along the 
principal diagonal. Hence, 


T 32 
poy = DX, [5.4.14] 
t=1 Gy, 
Substituting [5.4.13] and [5.4.14] into [5.4.10], the likelihood function is 
T -12 iz xy 
fey; 8) = (22 {Tl a, exo| 3 zi. [5.4.15] 
tmi 2 tm d, 


The exact log likelihood for a Gaussian MA(1) process is therefore 


T ig 1S ¥ 
£(@) = log fey; 8) = —> log(2n) — 5 >» log(d,) — rp =! [5.4.16] 
Given numerical values for 4, 8, and a”, the sequence ¥, is calculated by iterating 
on [5.4.11] starting with ¥, = y, — m, while d,, is given by [5.4.12]. 

In contrast to the conditional log likelihood function [5.4.5], expression [5.4.16] 
will be valid regardless of whether 6 is associated with an invertible MA(1) rep- 
resentation. The value of [5.4.16] at @ = 6, o? = G? will be identical to its value 
at 6 = 671, a? = 667; see Exercise 5.1. 


5.4, The Likelihood Function for a Gaussian MA(1) Process 129 


5.5. The Likelihood Function for a Gaussian 
MA(q) Process 
Conditional Likelihood Function 
For the MA(q) process, 
Y, = w +e, + Oe; + 86,2 t+ + + 86-4, [5.5.1] 


a simple approach is to condition on the assumption that the first q values for € 
were all zero: 


&) = €-, = °° = E_g41 = O. [5.5.2]. 
From these starting values we can iterate on 
& = ¥p — Mh — O61 — O26-2 ~ °° * — 88-4 [5.5.3] 
fort = 1,2,..., T. Let 9 denote the (q x 1) vector (€, €-1,.. +5 €-g+i)’: 
The conditional log likelihood is then 
£(0) = log fy ayy y....¥ileos0(YT> Yr-1r + ++» Vil&o = 0; @) 
= -F log(2n) - F tog(o?) - > = : [5.5.4] 


where @ = (j1, , 62, ... , 0), 0)’. Again, expression [5.5.4] is useful only if all 
values of z for which 


1+ Oz + 027 +--+ + 6,27 =0 


lie outside the unit circle. 


Exact Likelihood Function 
The exact likelihood function is given by 
Felys @) = (27) 7?|Q|-%? exp[-Hy — wy'Q-(y — w)], [5.5.5] 


where as before y = (y1, y2,..-,y7)' andp=(u, u,...,)’. Here Q represents 
the variance-covariance matrix of T consecutive draws from an MA(q) process: 


(5.5.6} 


130 Chapter 5 | Maximum Likelihood Estimation 


The row i, column j element of Q is given by y,_ 4, where -, is the kth autocovari- 
ance of an MA(q) process: 


_ o7(6, + 6g 19; + 644285 Ge 8,84 —%) fork =0,1,..., q 
eS 0 fork > q, 
[5.5.7] 


where 6 = 1. Again, the exact likelihood function [5.5.5] can be evaluated using 
either the Kalman filter of Chapter 13 or the triangular factorization of Q, 


Q = ADA’, [5.5.8] 


where A is the lower triangular matrix given by [4.4.11] and D is the diagonal 
matrix given by [4.4.7]. Note that the band structure of © in [5.5.6] makes A and 
D simple to calculate. After the first (¢ + 1) rows, all the subsequent entries in 
the first column of © are already zero, so no multiple of the first row need be 
added to make these zero. Hence, a,, = 0 fori >q + 1. Similarly, beyond the 
first (q + 2) rows of the second column, no multiple of the second row need be 
added to make these entries zero, meaning that a4. = 0 fori > q + 2. Thus A is 
a lower triangular band matrix with a4, = 0 fori > q + j: 


1 0 0 tee 0 0 
Qo, 1 0 0 0 
931 932 1 0 0 
A= 
Gg+ia 4412 49413 0 0 
0 Gq+22 49423 0 0 
0 0 0 wee fee 1 


A computer can be programmed to calculate these matrices quickly for a given 
numerical value for 6. 

Substituting [5.5.8] into [5.5.5], the exact likelihood function for a Gaussian 
MA(Qq) process can be written as in [5.4.10]: 


fely; @) = (22)- ™|D|-"? exp[—39'D~'y] 
where 
Ay =y- B. [5.5.9] 


The elements of ¥ can be calculated recursively by working down the rows of 
[5.5.9]: 


v=)i—b 
Jo = (¥2 — BH) — Gays 
Js = (ys — B) — Gx252 — anys 


Jr = (Ye — M) A Gera — Gee aVe-2 TT tt gra: 


The exact log likelihood function can then be calculated as in [5.4.16]: 


T id 1< ¥? 
£(0) = log fy(y; @) = 3 log(27) — 5 2, log(d,,) — 5 » or [5.5.10] 
bal ti tt 


5.5. The Likelihood Function for a Gaussian MA(q) Process 131 


5.6. The Likelihood Function for a Gaussian 
ARMA(p, q) Process 


Conditional Likelihood Function 
A Gaussian ARMA(p, q) process takes the form 
Y, =ct PiY,-1 + $2Y,-2 tai 2 bpp + & 


5.6.1 
+ O)€,-1 + O62 + °° * + 8,84, [ ] 


where e, ~ i.i.d. N(0, a7). The goal is to estimate the vector of population param- 
eters 0 = (c, di, ba, -- - > Dp» O15 B25» + + 9, 07)’. 

The approximation to the likelihood function for an autoregression condi- 
tioned on initial values of the y’s. The approximation to the likelihood function 
for a moving average process conditioned on initial values of the e’s. A common 
approximation to the likelihood function for an ARMA(p, q) process conditions 
on both y’s and e’s. 


Taking initial values for yp = (Yo, Y-1,. + + »Y-—p+i)' and & = (€, €-4,..-, 
E_9+1)' aS given, the sequence {g), &, ... , €7} can be calculated from {y,, y2, 
. , Yr} by iterating on 

ee = y,-~ € — Piyr-1 _ $2Yr-2 aoa sas dbpYr-p [5.6.2] 

— O82; — &-2-— +++ — Og 

fort = 1,2,..., T. The conditional log likelihood is then 
£0) = log fey yp c¥l¥oeoW YT Yr-v + ss Yil¥o, £03 ®) 

= 2P T 3 Ze [5.6.3] 

= ~5 log(2m) - 5 log(o”) 2 es 
One option is to set initial y’s and e’s equal to their expected values. That 
is, set y, = (1 — d — & — +++ — $,) fors = 0, —-1,..., —p + land set 
&, = Ofors = 0, —1,..., —q + 1, and then proceed with the iteration in [5.6.2] 


fort = 1,2,..., T. Alternatively, Box and Jenkins (1976, p. 211) recommended 
setting e’s to zero but y’s equal to their actual values. Thus, iteration on [5.6.2] is 
started at date tf = p + 1 with y,, y2,..., y, Set to the observed values and 


Bp = Ep-1 = °° * = Ep-gti = 0 
Then the conditional likelihood calculated is 
log f(y7, noe »Y¥p+il¥p» seer Vir ED = 0,.-+, Ep—q+i = 0) 
—?P €; 
=- 2m) - 2) — —. 
7) log(27) 2 log(o?) 2h 3a 


As in the case for the moving average processes, these approximations should 
be used only if all values of z satisfying 


1+ jz + 627 + +++ + 6,27 =0 


lie outside the unit circle. 


Alternative Algorithms 


The simplest approach to calculating the exact likelihood function for a Gaus- 
sian ARMA process is to use the Kalman filter described in Chapter 13. For more 


132 Chapter 5 | Maximum Likelihood Estimation 


details on exact and approximate maximum likelihood estimation of ARMA models, 
see Galbraith and Galbraith (1974), Box and Jenkins (1976, Chapter 6), Hannan 
and Rissanen (1982), and Koreisha and Pukkila (1989). 


5.7. Numerical Optimization 


Previous sections of this chapter have shown how to calculate the log likelihood 
function 


£(8) = log fy ¥y—1 eaves yn Yr-15 +++ Mi 6) [5.7.1] 
for various specifications of the process thought to have generated the observed 
data y;, y2,.-., yr. Given the observed data, the formulas given could be used 


to calculate the value of £(8) for any given numerical value of @. 

This section discusses how to find the value of 6 that maximizes £(@) given 
no more knowledge than this ability to calculate the value of £(@) for any particular 
value of 8. The general approach is to write a procedure that enables a computer 
to calculate the numerical value of £(@) for any particular numerical values for @ 
and the observed data y,, y2,..-, yr. We can think of this procedure as a “‘black 
box” that enables us to guess some value of @ and see what the resulting value of 
£(0) would be: 


Input Procedure Output 


values of 


calculates value of 


The idea will be to make a series of different guesses for @, compare the value of 
£(@) for each guess, and try to infer from these values for £(@) the value @ for 
which £(@) is largest. Such methods are described as mumerical maximization. 


Grid Search 


The simplest approach to numerical maximization is known as the grid search 
method. To illustrate this approach, suppose we have data generated by an AR(1) 
process, for which the log likelihood was seen to be given by [5.2.9]. To keep the 
example very simple, it is assumed to be known that the mean of the process is 
zero (c = Q) and that the innovations have unit variance (0 = 1). Thus the only 
unknown parameter is the autoregressive coefficient ¢, and [5.2.9] simplifies to 


T 1 4 
£(¢) = 7 log(2m) + 5 log(1 — $7) 5.7.2] 


1 1< 
a S(t a Yi 75 > (yr by,-1)*. 
2 2 r=2 
Suppose that the observed sample consists of the following T = 5 observations: 
y= 08 yp =02 ys =-12 yy = -04 = ys = 0.0. 


If we make an arbitrary guess as to the value of ¢, say, @ = 0.0, and plug this 
guess into expression [5.7.2], we calculate that £(¢) = —5.73 at @ =.0.0. Trying 
another guess (¢ = 0.1), we calculate £(¢) = —5.71at ¢ = 0.1—the log likelihood 
is higher at @ = 0.1 than at ¢ = 0.0. Continuing in this fashion, we could calculate 
the value of £(@) for every value of @ between —0.9 and +0.9 in increments of 


5.7. Numerical Optimization 133 


0.1. The results are reported in Figure 5.1. It appears from these calculations that 
the log likelihood function £(¢) is nicely behaved with a unique maximum at some 
value of @ between 0.1 and 0.3. We could then focus on this subregion of the 
parameter space and evaluate £(¢) at a finer grid, calculating the value of £(¢) 
for all values of ¢ between 0.1 and 0.3 in increments of 0.02. Proceeding in this 
fashion, it should be possible to get arbitrarily close to the value of ¢ that maximizes 
£(¢) by making the grid finer and finer. 

Note that this procedure does not find the exact MLE ¢, but instead ap- 
proximates it with any accuracy desired. In general, this will be the case with any 
numerical maximization algorithm. To use these algorithms we therefore have to 
specify a convergence criterion, or some way of deciding when we are close enough 
to the true maximum. For example, suppose we want an estimate ¢ that differs 
from the true MLE by no more than +0.0001. Then we would continue refining 
the grid until the increments are in steps of 0.0001, and the best estimate among 
the elements of that grid would be the numerical MLE of ¢. 

For the simple AR(1) example in Figure 5.1, the log likelihood function is 
unimodal—there is a unique value @ for which 0£(@)/a@ = 0. For a general 
numerical maximization problem, this need not be the case. For example, suppose 
that we are interested in estimating a scalar parameter 6 for which the log likelihood 
function is as displayed in Figure 5.2. The value 6 = —0.6 is a local maximum, 
meaning that the likelihood function is higher there than for any other @ in a 
neighborhood around @ = —0.6. However, the global maximum occurs around 
@ = 0.2. The grid search method should work well for a unimodal likelihood as 
long as £(8) is continuous. When there are multiple local maxima, the grid must 
be sufficiently fine to reveal all of the local ‘“‘hills” on the likelihood surface. 


Steepest Ascent 


Grid search can be a very good method when there is a single unknown 
parameter to estimate. However, it quickly becomes intractable when the number 
of elements of @ becomes large. An alternative numerical method that often suc- 


8 


£($) 
FIGURE 5.1 Log likelihood for an AR(1) process for various guesses of ¢. 


134 Chapter 5 | Maximum Likelihood Estimation 


+705 


-8 
£(8) 


FIGURE 5.2 Bimodal log likelihood function. 


ceeds in maximizing a continuously differentiable function of a large number of 
parameters is known as steepest ascent. 

To understand this approach, let us temporarily disregard the “black box”’ 
nature of the investigation and instead examine how we would proceed analytically 
with a particular maximization problem. Suppose we have an initial estimate of 
the parameter vector, denoted 6, and wish to come up with a better estimate 
6‘, Imagine that we are constrained to choose 0 so that the squared distance 
between @ and 6!) is some fixed number k: 


{oO — @}{9 — 9} = k. 


The optimal value to choose for 6“) would then be the solution to the following 
constrained maximization problem: 


max £(0) subject to {0 — eM} {9 — QO} = k. 


ot) 
To characterize the solution to this problem,* form the Lagrangean, 
JO) = LOM) + afk — {8 — OO} {EM — Q}], [5.7.3] 


where A denotes a Lagrange multiplier. Differentiating [5.7.3] with respect to 6‘) 
and setting the result equal to zero yields 


a£(@ 
(®) — (2afo® — 6} = 0. [5.7.4] 
88 | gaa 
Let g(@) denote the gradient vector of the log likelihood function: 
— 2£(0) 
a(0) = =. 


If there are a elements of @, then g(@) is an (a x 1) vector whose ith element 
represents the derivative of the log likelihood with respect to the ith element of 0. 


3See Chiang (1974) for an introduction to the use of Lagrange multipliers for solving a constrained 
optimization problem, 


5.7. Numerical Optimization 135 


Using this notation, expression [5.7.4] can be written as 
6 — 9 = [1/(2A)] - g(@). [5.7.5] 


Expression [5.7.5] asserts that if we are allowed to change @ by only a fixed 
amount, the biggest increase in the log likelihood function will be achieved if the 
change in @ (the magnitude @() — @) is chosen to be a constant 1/(2A) times the 
gradient vector g(@“”). If we are contemplating a very small step (so that k is near 
zero), the value g(@“) will approach g(@). In other words, the gradient vector 
g(9) gives the direction in which the log likelihood function increases most steeply 
from 6, 

For illustration, suppose that a = 2 and let the log likelihood be 


(0) = 1.567 — 263. [5.7.6] 


We can easily see analytically for this example that the MLE is given by 6 = 
(0, 0)'. Let us nevertheless use this example to illustrate how the method of steepest 
ascent works. The elements of the gradient vector are 


#0), HO _ ay sa 
Suppose that the initial guess is 9 = (—1, 1)’. Then 
ae!  _, a) __, 
36, e= at) 082 e=e 


An increase in 6, would increase the likelihood, while an increase in @, would 
decrease the likelihood. The gradient vector evaluated at 8 is 


2(0) -[2) 


so that the optimal step 0“) — @© should be proportional to (3, — 4)‘. For example, 
with k = 1 we would choose 


ge = 9 =3 
; ay - a) = -%, 
that is, the new guesses would be 6{ = —0.4 and 69) = 0.2. To increase the 


likelihood by the greatest amount, we want to increase 6, and decrease 0, relative 
to their values at the initial guess @. Since a one-unit change in 6, has a bigger 
effect on £(@) than would a one-unit change in @,, the change in @, is larger in 
absolute value than the change in 6,. 

Let us now return to the black box perspective, where the only capability we 
have is to calculate the value of £(@) for a specified numerical value of 6. We 
might start with an arbitrary initial guess for the value of 8, denoted 6. Suppose 
we then calculate the value of the gradient vector at ©: 


a£(@) 


(0) = 
2(@) a 


[5.7.8] 


@=9) 


This gradient could in principle be calculated analytically, by differentiating the 
general expression for £(8) with respect to @ and writing a computer procedure 
to calculate each element of g(@) given the data and a numerical value for @. For 
example, expression [5.7.7] could be used to calculate g(@) for any particular value 
of 8. Alternatively, if it is too hard to differentiate £(@) analytically, we can always 


136 Chapter 5 | Maximum Likelihood Estimation 


get a numerical approximation to the gradient by seeing how £(@) changes for a 
small change in each element of @. In particular, the ith element of g(@) might 
be approximated by 


1 
g (0) = reer”, es, eaeriecay 6, a? + A, ee 825 een ae} a) 


[5.7.9] 
— LEB, OP, «=» OP, 8, BP, Oa, -- - OP}, 


where A represents some arbitrarily chosen small scalar such as A = 10~°. By 
numerically calculating the value of £(@) at @ and at a different values of @ 
corresponding to small changes in each of the individual elements of @, an es- 
timate of the full vector g(@) can be uncovered. 

Result [5.7.5] suggests that we should change the value of @ in the direction 
of the gradient, choosing 


8 — 9 = 5-g(9) 


for some positive scalar s. A suitable choice for s could be found by an adaptation 
of the grid search method. For example, we might calculate the value of £{@ + 
s-g(0)} for s = 1, 8, 4, 3, 1, 2, 4, 8, and 16 and choose as the new estimate 0‘) 
the value of 0 + s-g(@) for which £(@) is largest. Smaller or larger values of 
s could also be explored if the maximum appears to be at one of the extremes. If 
none of the values of s improves the likelihood, then a very small value for s such 
as the value A = 10~-° used to approximate the derivative should be tried. 

We can then repeat the process, taking 0 = @© + s-g(@) as the starting 
point, evaluating the gradient at the new location g(@“), and generating a new 
estimate 6@ according to 


62 = gM + s-g(8() 
for the best choice of s. The process is iterated, calculating 
QF) = ee” + s-2(8() 


for m = 0,1, 2, . . . until some convergence criterion is satisfied, such as that the 
gradient vector g(@°”)) is within some specified tolerance of zero, the distance 
between 6°"*» and 6” is less than some specified threshold, or the change be- 
tween £(0°"*") and £(6°”) is smaller than some desired amount. 

Figure 5.3 illustrates the method of steepest ascent when @ contains a = 2 
elements. The figure displays contour lines for the log likelihood £(@); along a 
given contour, the log likelihood £(@) is constant. If the iteration is started at the 
initial guess @©, the gradient g(@©) describes the direction of steepest ascent. 
Finding the optimal step in that direction produces the new estimate 6“. The 
gradient at that point g(@) then determines a new search direction on which a 
new estimate 6°) is based, until the top of the hill is reached. 

Figure 5.3 also illustrates a multivariate generalization of the problem with 
multiple local maxima seen earlier in Figure 5.2. The procedure should converge 
to a local maximum, which in this case is different from the global maximum @*. 
In Figure 5.3, it appears that if @©* were used to begin the iteration in place of 
6, the procedure would converge to the true global maximum @*. In practice, 
the only way to ensure that a global maximum is found is to begin the iteration 
from a number of different starting values for 8 and to continue the sequence 
from each starting value until the top of the hill associated with that starting value 
is discovered. 


5.7, Numerical Optimization 137 


a 


QOn 


3(0) 


gO. 


4, 


FIGURE 5.3 Likelihood contours and maximization by steepest ascent. 


Newton-Raphson 


One drawback to the steepest-ascent method is that it may require a very 
large number of iterations to close in on the local maximum. An alternative method 
known as Newton-Raphson often converges more quickly provided that (1) second 
derivatives of the log likelihood function £(@) exist and (2) the function £(@) is 
concave, meaning that —1 times the matrix of second derivatives is everywhere 
positive definite. 

Suppose that @ is an (a x 1) vector of parameters to be estimated. Let g(@) 
denote the gradient vector of the log likelihood function at @; 


oF(6) 
@) = 
2( x . Gli) 


? 
@=@©) 


and let H(@) denote —1 times the matrix of second derivatives of the log like- 
lihood function: 


a?£(0) 
©) = ——— 
HO = 30 00! 


@=@0) 
Consider approximating £(@) with a second-order Taylor series around 6: 
£(0) = £0) + [g(0)]}'[0 — 0] — He — OHO — 0). [5.7.10] 


The idea behind the Newton-Raphson method is to choose @ so as to maximize 
[5.7.10]. Setting the derivative of [5.7.10] with respect to @ equal to zero results 
in 


2(0) — H(0)[e — 6] = 0. [5.7.11] 


138 Chapter 5 | Maximum Likelihood Estimation 


Let 6 denote an initial guess as to the value of 6. One can calculate the 
derivative of the log likelihood at that initial guess (g(@)) either analytically, as 
in [5,7.7], or numerically, as in [5.7.9]. One can also use analytical or numerical 
methods to calculate the negative of the matrix of second derivatives at the initial 
guess (H(@©)). Expression [5.7.11] suggests that an improved estimate of 6 (de- 
noted 6“) would satisfy 


2(0) = H(0)[e — 6] 
or 
@) — @© = [H(@)]- 12/8). [5.7.12] 


One could next calculate the gradient and Hessian at @“ and use these to find a 
new estimate 6‘ and continue iterating in this fashion, The mth step in the iteration 
updates the estimate of @ by using the formula 


arm) = Oe + [H(O™)]-1g(0). [5.7.13] 


If the log likelihood function happens to be a perfect quadratic function, then 
[5.7.10] holds exactly and [5.7.12] will generate the exact MLE in a single step: 


e = Ouce- 


If the quadratic approximation is reasonably good, Newton-Raphson should con- 
verge to the local maximum more quickly than the steepest-ascent method. How- 
ever, if the likelihood function is not concave, Newton-Raphson behaves quite 
poorly. Thus, steepest ascent is often slower to converge but sometimes proves to 
be more robust compared with Newton-Raphson. 

Since [5.7.10] is usually only an approximation to the true log likelihood 
function, the iteration on [5.7.13] is often modified as follows. Expression [5.7,13] 
is taken to suggest the search direction. The value of the log likelihood function 
at several points in that direction is then calculated, and the best value determines 
the length of the step. This strategy calls for replacing [5.7.13] by 


OD = OC + SfH(O(™)]-Ig(0™), [5.7.14] 


where s is a scalar controlling the step length. One calculates @("+1) and the 
associated value for the log likelihood £(@("+») for various values of s in [5.7.14] 
and chooses as the estimate 6°"* the value that produces the biggest value for 
the log likelihood. 


Davidon-Fletcher-Powell 


If @ contains a unknown parameters, then the symmetric matrix H(®) has 
a(a + 1)/2 separate elements. Calculating all these elements can be extremely time- 
consuming if a is large. An alternative approach reasons as follows. The matrix of 
second derivatives (—H(@)) corresponds to the first derivatives of the gradient 
vector (g(@)), which tell us how g(@) changes as @ changes. We get some inde- 
pendent information about this by comparing g(@) — g(@) with a™ - @®. 
This is not enough information by itself to estimate H(@), but it is information that 
could be used to update an initial guess about the value of H(@). Thus, rather than 
evaluate H(@) directly at each iteration, the idea will be to start with an initial 
guess about H(@) and update the guess solely on the basis of how much g(8) changes 
between iterations, given the magnitude of the change in @. Such methods are 
sometimes described as modified Newton-Raphson. 

One of the most popular modified Newton-Raphson methods was proposed 
by Davidon (1959) and Fletcher and Powell (1963). Since it is H~! rather than H 


5.7. Numerical Optimization 139 


itself that appears in the updating formula [5.7.14], the Davidon-Fletcher-Powell 
algorithm updates an estimate of H~1 at each step on the basis of the size of the 
change in g(@) relative to the change in @. Specifically, let 6°”) denote an estimate 
of @ that has been calculated at the mth iteration, and let A“ denote an estimate 
of [H(0)]-1. The new estimate @°"*» is given by 


Oe" FD = BOM + sAC™(O0™) [5.7.15] 
for s the positive scalar that maximizes L{0”) + sAg(8™)}. Once 6("+ and 
the gradient at @¢"* have been calculated, a new estimate A‘* is found from 
AC™)(Agi™+D) (Agen 0)! Am 

(Ag *D)'A™(Agin+d) 
(AO(™+D) (AQ +0)! [5.7.16] 
(Ag’"*))'(A@@+)) 


ACr+) = Avy) — 


where 
AQ +D = Qlr+1) _ ee” 
Ago" +2) = g(00"*)) a g(e). 
In what sense should A’"*) as calculated from [5.7.16] be regarded as an 


estimate of the inverse of H(@°"+)? Consider first the case when @ is a scalar 
(a = 1). Then [5.7.16] simplifies to 


Aner = aim — AYA DP (Agemrnye 
= (Agt'+ D)2¢4™)) (Ag"*D)(Agi"*D) 
m Ag+) 
= Am — Alm _ gerd 
Agvr+ 1) 
7 ~ Agr" 


In this case, 


[Ac *0]-1 = Agia) 


~ A gen +1)’ 
which is the natural discrete approximation to 
2: 
(a+) = _oef = _ 8 . 
og? @ me g(r +t) 00 @= genet) 


More generally (for a > 1), an estimate of the derivative of g(-) should be 
related to the observed change in g(:) according to 


gam) = g(0™) +B) aime — gery, 
00 e=eertl) 
That is, 
g(8(" +) = g(8) _ HO" +) [9+ = 6] 
or 


Agi+) =s - [H(0%"*+)]-1 Ag(™*)), 
Hence an estimate A‘"* of [H(@“"*)]-+ should satisfy 
AMD Ager*) = —Agem+D, [5.7.17] 


140 Chapter 5 | Maximum Likelihood Estimation 


Postmultiplication of [5.7.16] by Ag‘"* confirms that [5.7.17] is indeed satisfied 
by the Davidon-Fletcher-Powell estimate A+: 
Aim+t) Age" +) = At Age"+D 
AM (Ag+ D)(Ag’™*D)!ACD(Agon+ 1) 
(Ag™*>)' (Ag) 
(A8™*)(A™*D)' (Agen) 
A” Age™+D — Al Agir+) — agir+n 
= Agett}), 
Thus, calculation of [5.7.16] produces an estimate of [H(@°"*)]- 1 that is consistent 
with the magnitude of the observed change between g(@°"*) and g(0°”) given 
the size of the change between 6°"* and 6°, 


The following proposition (proved in Appendix 5,A at the end of the chapter) 
establishes some further useful properties of the updating formula [5.7.16]. 


Proposition 5.1: (Fletcher and Powell (1963)). Consider £(@), where 
&: R? — R! has continuous first derivatives denoted 


d£(8) 
ev”) = OO . 
at" ie 38 @= a(n) 


Suppose that some element of g(6°) is nonzero, and let A’ be a positive definite 
symmetric (a X a) matrix. Then the following hold, 


(a) There exists a scalar s > 0 such that £(0°"*) > £(0) for 
Ot) = O™ + sAMRg(O™), [5.7.18] 
(b) If s in [5.7.18] is chosen so as to maximize £(0°"*»), then the first-order 
conditions for an interior maximum imply that 
[g(0*)]' [e+ — 9] = 0. [5.7.19] 
(c) Provided that [5.7.19] holds and that some element of g(@°"*) — 

g(0(") is nonzero, then A°"+ described by [5.7.16] is a positive definite 

symmetric matrix. 

Result (a) establishes that as long as we are not already at an optimum 
(g(9°) # 0), there exists a step in the direction suggested by the algorithm that 
will increase the likelihood further, provided that A‘ is a positive definite matrix. 
Result (c) establishes that provided that the iteration is begun with A® a positive 
definite matrix, then the sequence of matrices {A“”}%_, should all be positive 
definite, meaning that each step of the iteration should increase the likelihood 
function. A standard procedure is to start the iteration with A® = I,, the (a x a) 
identity matrix. 

If the function £(@) is exactly quadratic, so that 

£(0) = £10) + g’fa — 6] — 3[6 — 6]'H[6 — 0], 
with H positive definite, then Fletcher and Powell (1963) showed that iteration on 
[5.7.15] and [5.7.16] will converge to the true global maximum in a steps: 
a =, 6ute _ 9 + H-‘g: 


and the weighting matrix will converge to the inverse of —1 times the matrix of 
second derivatives: 


A® = HO}, 


5.7. Numerical Optimization 141 


More generally, if £(@) is well approximated by a quadratic function, then the 
Davidon-Fletcher-Powell search procedure should approach the global maximum 
more quickly than the steepest-ascent method, 


eM = Outer 


for large N, while A“ should converge to the negative of the matrix of second 
derivatives of the log likelihood function: 
-1 
: [5.7.20] 
@=6mze. 


AM) = — 2 
20 30’ 
In practice, however, the approximation in [5.7.20] can be somewhat poor, and it 
is better to evaluate the matrix of second derivatives numerically for purposes of 
calculating standard errors, as discussed in Section 5.8. 

If the function £(@) is not globally concave or if the starting value 6 is far 
from the true maximum, the Davidon-Fletcher-Powell procedure can do very badly. 
If problems are encountered, it often helps to try a different starting value @, to 
rescale the data or parameters so that the elements of @ are in comparable units, 
or to rescale the initial matrix A©—for example, by setting 


A® = (1 x 1074)I,. 


Other Numerical Optimization Methods 


A variety of other modified Newton-Raphson methods are available, which 
use alternative techniques for updating an estimate of H(@‘”) or its inverse. Two 
of the more popular methods are those of Broyden (1965, 1967) and Berndt, Hall, 
Hall, and Hausman (1974). Surveys of these and a variety of other approaches are 
provided by Judge, Griffiths, Hill, and Lee (1980, pp. 719-72) and Quandt (1983). 

Obviously, these same methods can be used to minimize a function Q(@) with 
respect to @. We simply multiply the objective function by —1 and then maximize 
the function — Q(@). 


5.8. Statistical Inference with Maximum Likelihood 
Estimation 


4 


The previous section discussed ways to find the maximum likelihood estimate 6 
given only the numerical ability to evaluate the log likelihood function £(@). This 
section summarizes general approaches that can be used to test a hypothesis about 
6. The section merely summarizes a number of useful results without providing 
any proofs. We will return to these issues in more depth in Chapter 14, where the 
statistical foundation behind many of these claims will be developed. 

Before detailing these results, however, it is worth calling attention to two 
of the key assumptions behind the formulas presented in this section. First, it is 
assumed that the observed data are strictly stationary. Second, it is assumed that 
neither the estimate 6 nor the true value @, falls on a boundary of the allowable 
parameter space. For example, suppose that the first element of 6 is a parameter 
corresponding to the probability of a particular event, which must be between 0 
and 1. If the event did not occur in the sample, the maximum likelihood estimate 
of the probability might be zero. This is an example where the estimate 6 falls on 
the boundary of the allowable parameter space, in which case the formulas pre- 
sented in this section will not be valid. 


142 Chapter 5 | Maximum Likelihood Estimation 


Asymptotic Standard Errors for Maximum Likelihood 
Estimates 


If the sample size T is sufficiently large, it often turns out that the distribution 
of the maximum likelihood estimate 6 can be well approximated by the following 
distribution: 


6 = N(@, T-1$-), [5.8.1] 


where @) denotes the true parameter vector, The matrix # is known as the infor- 
mation matrix and can be estimated in either of two ways. 
The second-derivative estimate of the information matrix is 


’ 7-1 ZL) 


ae ee [5.8.2] 


Here £(6) denotes the log likelihood: 


T 
£0) = 2 log fyya,_,(Y%-13 @); 


and Y, denotes the history of observations on y obtained through date t. The matrix 
of second derivatives of the log likelihood is often calculated numerically. Substi- 
tuting [S. 8.2] into [5.8.1], the terms involving the sample size T cancel out so that 
the variance-covariance matrix of 6 can be approximated by 


aea)| 
er 7 [5.8.3] 


E(6 — 0,)(6 - @)' = [- 


A second estimate of the information matrix in [5.8.1] is called the outer- 
product estimate: 


T 
$op = T7! >» [h(6, %,)] - [h(6, %,)]’. [5.8.4] 


Here h(6, Y,) denotes the (a x 1) vector of derivatives of the log of the conditional 
density of the rth observation with respect to the a elements of the parameter vector 
6, with this derivative evaluated at the maximum likelihood estimate ) 


I wots Vpdayes alec 
n(6, yy) = a og f(ydy: a 2 6) 


In this case, the variance-covariance matrix of 6 is approximated by 


26 ~ o9(0 — oy = [Fm 91-6. 90'] 


As an illustration of how such approximations can be used, suppose that the 
log likelihood is given by expression [5.7.6]. For this case, one can see analytically 
that 


ee) [-3 0 
aea0’ =| OF) C4)” 


and so result [5.8.3] suggests that the variance of the maximum likelihood estimate 
6, can be approximated by }. The MLE for this example was 6, = 0. Thus an 


5.8. Statistical Inference with Maximum Likelihood Estimation 143 


approximate 95% confidence interval for 6, is given by 
0+ 2Vy= +1. 


Note that unless the off-diagonal elements of $ are zero, in general one needs 
to calculate all the elements of the matrix $ and invert this full matrix in order to 
obtain a standard error for any given parameter. : 

Which estimate of the information matrix, $, or $gp, is it better to use in 
practice? Expression [5.8.1] is only an approximation to the true distribution of 
6, and $,, and $9> are in turn only approximations to the true value of $. The 
theory that justifies these approximations does not give any clear guidance to which 
is better to use, and typically, researchers rely on whichever estimate of the in- 
formation matrix is easiest to calculate. If the two estimates differ a great deal, 
this may mean that the model is misspecified. White (1982) developed a general 
test of model specification based on this idea. One option for constructing standard 
errors when the two estimates differ significantly is to use the “‘quasi-maximum 
likelihood” standard errors discussed at the end of this section. 


Likelihood Ratio Test 


Another popular approach to testing hypotheses about parameters that are 
estimated by maximum likelihood is the likelihood ratio test. Suppose a null hy- 
pothesis implies a set of m different restrictions on the value of the (a x 1) 
parameter vector @. First, we maximize the likelihood function ignoring these 
restrictions to obtain the unrestricted maximum likelihood estimate 6. Next, we 
find an estimate @ that makes the likelihood as large as possible while still satisfy- 
ing all the restrictions. In practice, this is usually achieved by defining a new 
((@ — m) x 1] vector A in terms of which all of the elements of @ can be expressed 
when the restrictions are satisfied. For example, if the restriction is that the last 
m elements of @ are zero, then A consists of the first a — m elements of 0. Let 
(6) denote the value of the log likelihood function at the unrestricted estimate, 
and let £(6) denote the value of the log likelihood function at the restricted 
estimate. Clearly £(6) > £(6), and it often proves to be the case that 


2[£(6) — £(6)] ~ x2(n). [5.8.5] 
For example, suppose that a = 2 and we are interested in testing the hy- 
pothesis that 6. = @, + 1. Under this null hypothesis the vector (6,, @,)' can be 
written as (A, A + 1)’, where A = 6,. Suppose that the log likelihood is given by 
expression [5.7.6]. One can find the restricted MLE by replacing @, by @, + 1 and 
maximizing the resulting expression with respect to @,: 
£(6,) = -1.563 ~ 2(0, + 1). 
The first-order condition for maximization of £(6,) is 
—30, — 4(6, + 1) = 0, 
or 0, = —4. The restricted MLE is thus 6 = (—$, 3)’, and the maximum value 
attained for the log likelihood while satisfying the restriction is 


£8) = (-3(-4" - GY 
= -{(3-4/(2-7-7)}H4 + 3} 
= —§, 
The unrestricted MLE is 6 = 0, at which £(6) = 0. Hence, [5.8.5] would be 
2[¢(6) -— £(6)] = # = 1.71. 


144 Chapter 5 | Maximum Likelihood Estimation 


The test here involves a single restriction, som = 1. From Table B.2 in Appendix 
B, the probability that a y?(1) variable exceeds 3.84 is 0.05. Since 1.71 < 3.84, we 
accept the null hypothesis that @, = @, + 1 at the 5% significance level. 


Lagrange Multiplier Test 


In order to use the standard errors from [5.8.2] or [5.8.4] to test a hypothesis 
about 6, we need only to find the unrestricted MLE @. In order to use the likelihood 
ratio test [5.8.5], it is necessary to find both the unrestricted MLE 6 and the re- 
stricted MLE 6. The Lagrange multiplier test provides a third principle with which 
to test a null hypothesis that requires only the restricted MLE 6. This test is useful 
when it is easier to calculate the restricted estimate 6 than the unrestricted estimate 
6. 

Let @ be an (a X 1) vector of parameters, and let 6 be an estimate of @ that 
maximizes the log likelihood subject to a set of m restrictions on 6. Let f( yd Yenty 
Yr-2, » - » 3 @) be the conditional density of the tth observation, and let h(6, Y,) 
denote the (a x 1) vector of derivatives of the log of this conditional density 
evaluated at the restricted estimate 6: 


9 log f(ydyr-1» Ye-a» «+ - 3 8) 


n(6, Y,) = = 


The Lagrange multiplier test of the null hypothesis that the restrictions are true is 
given by the following statistic: 


cS b h(6, | s[3, h(6, a9], [5.8.6] 


If the null hypothesis is true, then for large T this should approximately have a 
x?(m) distribution. The information matrix $ can again be estimated as in [5.8.2] 
or [5.8.4] with @ replaced by 8. 


Quasi-Maximum Likelihood Standard Errors 


It was mentioned earlier in this section that if the data were really generated 
from the assumed density and the sample size is sufficiently large, the second- 
derivative estimate $,, and the outer-product estimate $9, of the information 
matrix should be reasonably close to each other. However, maximum likelihood 
estimation may still be a reasonable way to estimate parameters even if the data 
were not generated by the assumed density. For example, we noted in Section 5.2 
that the conditional MLE for a Gaussian AR(1) process is obtained from an OLS 
regression of y,on y,_,. This OLS regression is often a very sensible way to estimate 
parameters of an AR(1) process even if the true innovations ¢, are not i.1.d. Gaus- 
sian. Although maximum likelihood may be yielding a reasonable estimate of @, 
when the innovations are noti.i.d. Gaussian, the standard errors proposed in [5.8.2] 
or [5.8.4] may no longer be valid. An approximate variance-covariance matrix for 
6 that is sometimes valid even if the probability density is misspecified is given by 


E(6 — 0)(6 — 0)’ = T7{$op$G39.p}-?. " [5.8.7] 


This variance-covariance matrix was proposed by White (1982), who described this 
approach as quasi-maximum likelihood estimation. 


5.8. Statistical Inference with Maximum Likelihood Estimation 145 


5.9. Inequality Constraints 


A Common Pitfall with Numerical Maximization 


Suppose we were to apply one of the methods discussed in Section 5.7 such 
as steepest ascent to the AR(1) likelihood [5.7.2]. We start with an arbitrary initial 
guess, say, @ = 0.1. We calculate the gradient at this point, and find that it is 
positive, The computer is then programmed to try to improve this estimate by 
evaluating the log likelihood at points described by 69 = © + s-g(@) for 
various values of s, seeing what works best, But if the computer were to try a value 
for s such that 6) = © + s-g(@™) = 1,1, calculation of [5.7.2] would involve 
finding the log of (1 — 1.17) = —0,21. Attempting to calculate the log of a negative 
number would typically be a fatal execution error, causing the search procedure 
to crash, 

Often such problems can be avoided by using modified Newton-Raphson 
procedures, provided that the initial estimate @© is chosen wisely and provided 
that the initial search area is kept fairly small. The latter might be accomplished 
by setting the initial weighting matrix A in [5.7.15] and [5.7.16] equal to a small 
multiple of the identity matrix, such as A® = (1 x 10-4)-I,. In later iterations, 
the algorithm should use the shape of the likelihood function in the vicinity of the 
maximum to keep the search conservative. However, if the true MLE is close to 
one of the boundaries (for example, if dy, = 0.998 in the AR(1) example), it 
will be virtually impossible to keep a numerical algorithm from exploring what 
happens when ¢ is greater than unity, which would induce a fatal crash. 


Solving the Problem by Reparameterizing the Likelihood 
Function 


One simple way to ensure that a numerical search always stays within certain 
specified boundaries is to reparameterize the likelihood function in terms of an 
(a X 1) vector A for which @ = g(A), where the function g: R? — R? incorporates 
the desired restrictions, The scheme is then as follows; 


Input Procedure Output 
values of _,| set @ = g(A); 
Vinay seen IT calculate £(@) 


and aA 


For example, to ensure that ¢ is always between +1, we could take 
$= sd) =~ (5.9.1] 
Be eal = 


The goal is to find the value of A that produces the biggest value for the log 
likelihood. We start with an initial guess such as A = 3, The procedure to evaluate 
the log likelihood function first calculates 


= 3/(1 + 3) = 0.75 


and then finds the value for the log likelihood associated with this value of ¢ from 
[5.7.2]. No matter what value for A the computer guesses, the value of @ in [5.9.1] 
will always be less than 1 in absolute value and the likelihood function will be well 


146 Chapter 5 | Maximum Likelihood Estimation 


defined. Once we have found the value of { that maximizes the likelihood function, 
the maximum likelihood estimate of ¢ is then given by 


a 


A A 
ares 


This technique of reparameterizing the likelihood function so that estimates 
always satisfy any necessary constraints is often very easy to implement. However, 
one note of caution should be mentioned. If a standard error is calculated from 
the matrix of second derivatives of the log likelihood as in [5.8.3], this represents 
the standard error of A, not the standard error of ¢. To obtain a standard error 
for ¢, the best approach is first to parameterize the likelihood function in terms 
of A to find the MLE, and then to reparameterize in terms of ¢ to calculate the 
matrix of second derivatives evaluated at ¢ to get the final standard error for ¢. 
Alternatively, one can calculate an approximation to the standard error for é from 
the standard error for A, based on the formula for a Wald test of a nonlinear 
hypothesis described in Chapter 14. 


Parameterizations for a Variance-Covariance Matrix 


Another common restriction one needs to impose is that a variance parameter 
a? be positive. An obvious way to achieve this is to parameterize the likelihood 
in terms of A which represents +1 times the standard deviation. The procedure to 
evaluate the log likelihood then begins by squaring this parameter A: 


= a2; 
and if the standard deviation a is itself called, it is calculated as 
c= VM, 


More generally, let Q denote an (m x nm) variance-covariance matrix: 


Oy Fiz *** iy 

92 922 ''* Gap, 
Q=] , 

Ori On2 °° * Onn 


Here one needs to impose the condition that 2 is positive definite and symmetric. 
The best approach is to parameterize © in terms of the n(n + 1)/2 distinct elements 
of the Cholesky decomposition of 2: 


Q = PP’, [5.9.2] 
where 
Ba 60. a cate 
Aan Ax 0 seace: 0 
P=; , , . . 
Ant An2 Ans cate Ann 


No matter what values the computer guesses for Ay,, Az, -- + Ann,» the matrix 2 


calculated from [5.9.2] will be symmetric and positive semidefinite. 


5.9. Inequality Constraints 147 


Parameterizations for Probabilities 


Sometimes some of the unknown parameters are probabilities p,, p.,..., 
Px Which must satisfy the restrictions 


O=p,;=1 fori =1,2,...,K 
Bit pPotes+ + pe =. 


In this case, one approach is to parameterize the probabilities in terms of A,, Az, 
. «+ Ag_1, Where 


APL + AP HAR + +++ + AE.) fori=1,2,...,K—-1 
V1 + AP + AZ + +++ + AR_1). 


Pi 
PK 


More General Inequality Constraints 


For more complicated inequality constraints that do not admit a simple re- 
parameterization, an approach that sometimes works is to put a branching statement 
in the procedure to evaluate the log likelihood function. The procedure first checks 
whether the constraint is satisfied. If it is, then the likelihood function is evaluated 
in the usual way. If it is not, then the procedure returns a large negative number 
in place of the value of the log likelihood function. Sometimes such an approach 
will allow an MLE satisfying the specified conditions to be found with simple 
numerical search procedures. 

If these measures prove inadequate, more complicated algorithms are avail- 
able. Judge, Griffiths, Hill, and Lee (1980, pp. 747-49) described some of the 
possible approaches. 


APPENDIX 5.A. Proofs of Chapter 5 Propositions 
@ Proof of Proposition 5.1. 


(a) By Taylor’s theorem, 
LOY) = LO) + [g(O)][O*) — 0] + RO, A), [5.A.1] 
Substituting [5.7.18] into [5.A.1], 
LOD) ~ LOM) = [g(0)]'sAMR(O%™) + ROO, OOr*”), [5.4.2] 
Since A(™ is positive definite and since g(0°) # 0, expression [5.A.2] establishes that 
Lem+)) — LM) = sco) + R,(O™, OM), 


where «(0) > 0. Moreover, s-!-R,(80”, O("*)) — 0 as s > 0. Hence, there exists an s 
such that £(00"+9) — LO) > 0, as claimed. 


(b) Direct differentiation reveals 


ae(@imD) 9 94 , 2298, |, BL 20, 
as a0, ds a0, as a6, as 
ager+) 
= [g(r 2 [5.4.3] 


= [20 AMsO™), 


with the last line following from [5.7.18]. The first-order conditions set [5.A.3] equal to 
zero, which implies 


0 = [g(0*)]'sAme(O™) = [a(O*D)][O* — OO], 
with the last line again following from [5.7.18]. This establishes the claim in [5.7.19]. 


148 Chapter 5 | Maximum Likelihood Estimation 


(c) Let y be any (a x 1) nonzero vector. The task is to show that y'AC'*)y > 0, 
Observe from [5.7.16] that 
y' AM (Ago *D)(Age+ Dy AMDy 


yACrDy = y Ay os (Ag@*)'A@O(Ago* 9) 


y'(A0™*D)(AQe+D)'y {5.A.4] 
(Ager*?)'(Ae*)) * 
Since A is positive definite, there exists a nonsingular matrix P such that 
A@™ = PP’, 
Define 
y= Py 


xt = P’Ag™*), 
Then [5.A.4] can be written as 
y/PP’(Ag*"*»)(Agi"*0)'PP'y 
(ag'"* »y'PP'(Ag™*) 
7 y' (AG * 9) (ABM *D)'y 
(Ag’"*))'(AG +5) 
= gery — OYE X*My*) (AO *D)(A0C*Y)'y 
y y x" x* (Ag"* 9)'(AQ@*) , 


Recalling equation [4.A.6], the first two terms in the last line of [5.4.5] represent the sum 
of squared residuals from an OLS regression of y* on x*, This cannot be negative, 


HI Ky Rg € 
y*'y* = (y"'x*)(x"y*) > 0; [5.4.6] 


x*'x* 


yAeedy = y'PPry — 


[5.4.5] 


it would equal zero only if the OLS regression has a perfect fit, or if y* = Bx* or P’y = 
BP’Ag“"*) for some f. Since P is nonsingular, expression [5.A.6] would only be zero if 
y = BAg*" for some B. Consider two cases. 


Case 1. There is no B such that y = BAg™*”. In this case, the inequality [5.A.6] is strict 
and [5.A.5] implies 
[y'aee+op 
(age) (ons) 
Since [y‘AO("*+ 5)? = 0, it follows that y’A("*+"y > 0, provided that 
(Ag’"*)'(Aem™+D) < 0, [5.A.7] 


yA dy >- 


But, from [5.7.19], 
(Ag@*) "(A6 (m+ ) 


[g(0"*») — g(6)}'(AO~*) 

—3(0™)' (0+) [5.4.8] 
—g(0™)'sA™R(O™), 

with the last line following from [5.7. 18]. But the final term in [5.4.8] must be negative, 


by virtue of the facts that A‘” is positive definite, s > 0, and g(0™) # 0. Hence, [D. ae 7) 
holds, meaning that A‘"* is positive definite for this case. 


i 


u 


Case 2. There exists a B such that y = BAg’*". In this case, [5.A.6] is zero, so that 
[5.4.5] becomes 


_y(Boer*)(AO"*D)'y 
(Ager) (AO) 
_ Bldg" 9)" (B0er*9) (49+ YY B(Agen+D) 
(Ag&"*)'(A8@+D) 
— B(Ag*?)' (ABC +D) = B7g(0)'sA™Z(8™) > 0, 


yAmtDy = 


asin [5.A.8]. 


A pendix .A. Proofs of Chanter § Pronoasitions 149 


Chapter 5 Exercises 


5.1. Show that the value of [5.4.16] at @ = 6, o? = G? is identical to its value at @ = 6-, 
o? = 676, 


5.2. Verify that expression [5.7.12] calculates the maximum of [5.7.6] in a single step from 
the initial estimate © = (—1, 1)’. 


5.3. Let (y1; y2,-- - ,¥7) be a sample of size T drawn from an i.i.d. N(u, 07) distribution. 
(a) Show that the maximum likelihood estimates are given by 


Tr 
f=T Dy, 
t=l 


T 
e >) (y, — A. 
z 


(b) Show that the matrix $,, in [5.8.2] is 
3 - [ve 0 
2»? Lo 1264) ]° 


(c) Show that for this example result [5.8.1] suggests 
el] y(fe] [evr 0 
e o7|’| 0 264/TI) 


Chapter 5 References 


Anderson, Brian D. O., and John B. Moore. 1979. Optimal Filtering. Englewood Cliffs, 
N.J.: Prentice-Hall. 

Berndt, E. K., B. H. Hall, R. E. Hall, andJ. A. Hausman. 1974. ‘Estimation and Inference 
in Nonlinear Structural Models.” Annals of Economic and Social Measurement 3:653-65. 

Box, George E. P., and D. R. Cox. 1964. “An Analysis of Transformations.” Journal of 
the Royal Statistical Society Series B, 26:211-52. 

and Gwilym M. Jenkins. 1976. Time Series Analysis: Forecasting and Control, rev. 
ed. San Francisco: Holden-Day. 

Broyden, C. G. 1965. ““A Class of Methods for Solving Nonlinear Simultaneous Equations.” 
Mathematics of Computation 19:577-93. 

. 1967. ‘“‘Quasi-Newton Methods and Their Application to Function Minimization.” 
Mathematics of Computation 21:368-81. 

Chiang, Alpha C. 1974. Fundamental Methods of Mathematical Economics, 2d ed. New 
York: McGraw-Hill. 

Davidon, W. C. 1959. “Variable Metric Method of Minimization.” A.E.C. Research and 
Development Report ANL-5990 (rev.). 

Fletcher, R., and M. J. D. Powell. 1963. “A Rapidly Convergent Descent Method for 
Minimization.” Computer Journal 6:163—-68. 

Galbraith, R. F., and J. I. Galbraith. 1974. “On the Inverses of Some Patterned Matrices 
Arising in the Theory of Stationary Time Series.” Journal of Applied Probability 11:63-71. 
Hannan, E., and J. Rissanen. 1982. “‘Recursive Estimation of Mixed Autoregressive—Mov- 
ing Average Order.” Biometrika 69:81-94. 

Janacek, G. J., and A. L. Swift. 1990. ““A Class of Models for Non-Normal Time Series.” 
Journal of Time Series Analysis 11:19-31. 

Judge, George G., William E. Griffiths, R. Carter Hill, and Tsoung-Chao Lee. 1980. The 
Theory and Practice of Econometrics. New York: Wiley. 

Koreisha, Sergio, and Tarmo Pukkila. 1989. “Fast Linear Estimation Methods for Vector 
Autoregressive Moving-Average Models.” Journal of Time Series Analysis 10:325-39. 

Li, W. K., and A. I, McLeod. 1988. “ARMA Modelling with Non-Gaussian Innovations.” 
Journal of Time Series Analysis 9:155-68. 

Martin, R. D. 1981. ‘“‘Robust Methods for Time Series,” in D. F. Findley, ed., Applied 
Time Series, Vol. Il. New York: Academic Press. 


150 Chapter 5 | Maximum Likelihood Estimation 


Nelson, Harold L., and C. W. J. Granger. 1979. “Experience with Using the Box-Cox 
Transformation When Forecasting Economic Time Series.” Journal of Econometrics 10:57- 
69. 

Quandt, Richard E. 1983. “Computational Problems and Methods,” in Zvi Griliches and 
Michael D. Intriligator, eds., Handbook of Econometrics, Vol. 1. Amsterdam: North-Holland. 
White, Halbert. 1982. ‘Maximum Likelihood Estimation of Misspecified Models.” Econ- 
ometrica 50:1-25. 


Chapter 5 References 151 


Spectral Analysis 


Up to this point in the book, the value of a variable Y, at date t has typically been 
described in terms of a sequence of innovations {e,}*_ _.. in models of the form 


Y,=pt+ 2», Wer: 
fa 


The focus has been on the implications of such a representation for the covariance 
between Y, and Y,at distinct dates ¢ and 7. This is known as analyzing the properties 
of {Y,}* _.. in the time domain. 

This chapter instead describes the value of Y, as a weighted sum of periodic 
functions of the form cos(wt) and sin(wf), where w denotes a particular frequency: 


aoe i a(w)-cos(wt) dw + f ” 8(w)-sin(at) do, 


The goal will be to determine how important cycles of different frequencies are in 
accounting for the behavior of Y. This is known as frequency-domain or spectral 
analysis. As we will see, the two kinds of analysis are not mutually exclusive. Any 
covariance-stationary process has both a time-domain representation and a fre- 
quency-domain representation, and any feature of the data that can be described 
by one representation can equally well be described by the other representation. 
For some features, the time-domain description may be simpler, while for other 
features the frequency-domain description may be simpler. 

Section 6.1 describes the properties of the population spectrum and introduces 
the spectral representation theorem, which can be viewed as a frequency-domain 
version of Wold’s theorem. Section 6.2 introduces the sample analog of the pop- 
ulation spectrum and uses an OLS regression framework to motivate the spectral 
representation theorem and to explain the sense in which the spectrum identifies 
the contributions to the variance of the observed data of periodic components with 
different cycles. Section 6.3 discusses strategies for estimating the population spec- 
trum. Section 6.4 provides an example of applying spectral techniques and discusses 
some of the ways they can be used in practice. More detailed discussions of spectral 
analysis are provided by Anderson (1971), Bloomfield (1976), and Fuller (1976). 


6.1. The Population Spectrum 


The Population Spectrum and Its Properties 


Let {Y,}% _.. be a covariance-stationary process with mean E(Y,) = p and 
jth autocovariance 


E(Y, — we )(%-;) -— #) = ¥,- 
152 


Assuming that these autocovariances are absolutely summable, the autocovariance- 
generating function is given by 


8r(z) = Dare [6.1.1] 


where z denotes a complex scalar. If [6.1.1] is divided by 27 and evaluated at some 
z represented by z = e~™ fori = V—1 and w a real scalar, the result is called 
the population spectrum of Y: 


1 1 < ; 
Sy(w) = an bre) = On ee (6.1.2] 


Note that the spectrum is a function of w: given any particular value of w and a 
sequence of autocovariances {y,}7__., we could in principle calculate the value of 
Sy(w). 

De Moivre’s theorem allows us to write e~“/ as 


ei = cos(wj) — i-sin(w/). (6.1.3] 
Substituting [6.1.3] into [6.1.2], it appears that the spectrum can equivalently be 
written 
1 < : pre te 
Sy(w) = rae y;[cos(wj) — i-sin(wf)]. [6.1.4] 


Note that for a covariance-stationary process, y, = y_,. Hence, [6.1.4] implies 


sy(w) = = yo [008(0) - ésin(0)] 


(6.1.5] 
1j< sity se ax tt : 
+ z{ > y;[cos(wj) + cos(—j) — i-sin(w/) — i-sin(— w/)] ¢. 
jel 
Next, we make use of the following results from trigonometry:! 
cos(Q) = 1 
sin(0) = 0 
sin(—@) = —sin(@) 
cos(—@) = cos(@). 
Using these relations, [6.1.5] simplifies to 
1 7 
Sy(w) = HH +2 y cos(ni)} [6.1.6] 
T j= 


Assuming that the sequence of autocovariances {y,}7__.. is absolutely sum- 
mable, expression [6.1.6] implies that the population spectrum exists and that sy(w) 
is a continuous, real-valued function of w. It is possible to go a bit further and 
show that if the y,’s represent autocovariances of a covariance-stationary process, 
then sy(w) will be nonnegative for all w.” Since cos(wj) = cos(—/) for any w, 
the spectrum is symmetric around w = 0. Finally, since cos[(w + 2k)-j] = cos(w/) 
for any integers k and j, it follows from [6.1.6] that sy(w + 2k) = sy(w) for any 
integer k. Hence, the spectrum is a periodic function of w. If we know the value 
of sy(w) for all w between 0 and 7, we can infer the value of sy(w) for any w. 


'These are reviewed in Section A.1 of the Mathematical Review (Appendix A) at the end of the 
book. 
*See, for example, Fuller (1976, p. 110). 


6.1..The Population Spectrum 153 


Calculating the Population Spectrum for Various Processes 
Let Y, follow an MA(©) process: 
Y,= p+ WLye, [6.1.7] 


where 
WL) = >» by! 


= 
> lw,l <0 
j=0 


o fort = 7 


E(e,£,) = { 


Recall from expression [3.6.8] that the autocovariance-generating function for Y 
is given by 


0 otherwise. 


8r(z) = o7y(z)p(z7*). 
Hence, from [6.1.2], the population spectrum for an MA(©) process is given by 
sy(w) = (2m)--oy(e-)y(e"), [6.1.8] 


For example, for a white noise process, y(z) = 1 and the population spectrum 
is a constant for all w: 


Sy(w) = 0/27. [6.1.9] 
Next, consider an MA(1) process: 
Y, = €, + O6,_;. 


Here, ¥(z) = 1 + 6z and the population spectrum is 
Sy(w) = (27)~!-o7(1 + Be~)(1 + Oe) [6.1.10] 
= (2m)7'-0%(1 + 0e- + Oe + 67). ad 
But notice that , 
e~@ + e@ = cos(w) — i-sin(w) + cos(w) + isin(w) = 2-cos(w), [6.1.11] 
so that [6.1.10] becomes 
Sy(w) = (27)-'-o?[1 + 6? + 26-cos(w)]. (6.1.12] 
Recall that cos(w) goes from 1 to —1 as w goes from 0 to 7. Hence, when 
6 > 0, the spectrum sy(w) is a monotonically decreasing function of w for w in 
(0, zr], whereas when @ < 0, the spectrum is monotonically increasing. 
For an AR(1) process 
Y,=c+ $Y,1 + &, 


we have ¢(z) = 1/(1 — $2) as long as |¢| < 1. Thus, the spectrum is 


(w) = (a ic ee 
a (1 — de-*)(1 -— de”) 
1 o 
= Im de” — ge + Py [6.1.13] 
1 oa? 


” ln [1 + ¢* — 2¢-cos(w)] 


154 Chapter 6 | Spectral Analysis 


When ¢ > 0, the denominator is monotonically increasing in w over [0, 7], meaning 
that s,(w) is monotonically decreasing. When ¢ < 0, the spectrum sy(w) is a 
monotonically increasing function of w. 
In general, for an ARMA(p, q) process 
Y,= c+ OY. + &Y,-2 +++ + OY» + & + OE 
+ O,8,2 t ++ + O84, 


the population spectrum is given by 


a? (1 + Ge + Be-?# + +++ + 6,e-!94) 
18) =a ge gem = an ee 
4 At et + Ot 4 nt agen) TENN 
(1 — de — ge? — +--+ — gyete)’ 


If the moving average and autoregressive polynomials are factored as follows: 


L + O02 + 0:27 + +++ + 6,29 = (1 — mz)(1 — mz)-+ + (1 — 192) 
1 — dz — 227 — +++ — G2? = (1 — Ayz(1 — Agz) +++ (1 - A,2), 


then the spectral density in [6.1.14] can be written 


q 
o? [] [1 + 17 — 2n;cos(w)] 
j= 


Sy(@) = 


- ; 
2m |] [1 + a? - 2A;cos(w)] 
j=1 


Calculating the Autocovariances from the Population Spectrum 


If we know the sequence of autocovariances {y,}*__., in principle we can 
calculate the value of sy(w) for any from [6.1.2] or [6.1.6]. The converse is also 
true: if we know the value of sy(w) for all w in (0, 7], we can calculate the value 
of the kth autocovariance y, for any given k. This means that the population 
spectrum sy(w) and the sequence of autocovariances contain exactly the same 
information—neither one can tell us anything about the process that is not possible 
to infer from the other. 

The following proposition (proved in Appendix 6.A at the end of this chapter) 
provides a formula for calculating any autocovariance from the population spec- 
trum. 


Proposition 6.1: Let {y,}7~ -.. be an absolutely summable sequence of autocovari- 
ances, and define sy(w) as in [6.1.2]. Then 


{ Sy(w)e dw = yx. [6.1.15] 
Result [6.1.15] can equivalently be written as 
[’ Sy(w) cos(wk) dw = yz. [6.1.16] 


61> The Panulatian Snectrum TEE 


Interpreting the Population Spectrum 


The following result obtains as a special case of Proposition 6.1 by setting 
k = 0: 


i Sy(w) dw = y. [6.1.17] 


In other words, the area under the population spectrum between + 7 gives Yo, the 
variance of Y,. 
More generally—since sy(w) is nonnegative—if we were to calculate 


{° Sy(w) dw 


for any w, between 0 and 7, the result would be a positive number that we could 
interpret as the portion of the variance of Y, that is associated with frequencies w 
that are less than w, in absolute value. Recalling that sy(w) is symmetric, the claim 
is that 


2: ie io) da [6.1.18] 


represents the portion of the variance of Y that could be attributed to periodic 
random components with frequency less than or equal to @,. 

What does it mean to attribute a certain portion of the variance of Y to cycles 
with frequency less than or equal to w,? To explore this question, let us consider 
the following rather special stochastic process. Suppose that the value of Y at date 
tis determined by 


M 
Y, = > [acos(wt) + 8;sin(w;t)]. [6.1.19] 
j=l 
Here, a; and 6, are zero-mean random variables, meaning that E(Y,) = 0 for all 


t. The sequences {a,}, and {5,}, are serially uncorrelated and mutually uncor- 
related: : 


o? =6forj=k 
Ete.) = a forj # k 
o? = forj=k 
BAG /84) = a for j # k 


E(a,6,) = 0 forall j and k. 


The variance of Y, is then 


E(¥Y?) = , [Ecapreottw + E(u | 


iT] 
Ms + 


a? [10,9 + sino) [6.1.20] 


M 
= > o?, 


with the last line following from equation [A.1.12]. Thus, for this process, the 
portion of the variance of Y that is due to cycles of frequency w,.is given by o?. 


156 Chapter 6 | Spectral Analysis 


If the frequencies are ordered 0 < w, < a < +++ < wy < 7, the portion of the 
variance of Y that is due to cycles of frequency less than or equal to w, is given by 
ot + of +--+ + 07. 

The Ath autocovariance of Y is 


E(Y,Y,_,) = {E(a?)-cos(w,;t)-cos[w,(t — k)] 


+ E(6?)-sin(@,t)-sin[w,(¢ — k)]} [6.1.21] 
= 3 a?{cos(w,t)-cos[w,(t — k)] 
+ sin(w,t)-sin[w,(t — k)]}. 


Recall the trigonometric identity? 
cos(A — B) = cos(A)-cos(B) + sin(A)-sin(B). [6.1.22] 


For A = wtand B = w(t ~ k), wehave A — B = wk, So that [6.1.21] becomes 
M 

E(Y,Y,-«) = >, o?:cos(w,k). - [6.1.23] 
j=1 


Since the mean and the autocovariances of Y are not functions of time, the process 
described by [6.1.19] is covariance-stationary, although [6.1.23] implies that the 
sequence of autocovariances {y,}f_9 is not absolutely summable. 

We were able to attribute a certain portion of the variance of Y, to cycles of 
less than a given frequency for the process in [6.1.19] because that is a rather special 
covariance-stationary process. However, there is a general result known as the 
spectral representation theorem which says that any covariance-stationary process 
Y, can be expressed in terms of a generalization of [6.1.19]. For any fixed frequency 
w in (0, 7], we define random variables a(w) and 5(w) and propose to write a 
stationary process with absolutely summable autocovariances in the form 


oe i (a(c»)-cos(at) + 8(w)-sin(wt)] deo 


The random processes represented by a(-) and 5(-) have zero mean and the further 
properties that for any frequencies 0 < a, < w, < w; < a, < 7, the variable 
“2 a(w) dw is uncorrelated with [2 a(w) dw and the variable Jy? 6(w) dw is 
uncorrelated with Jes 6(w) dw, while for any 0 < w, < w, < 7 and0 <a; < 
w, < 7, the variable [22 a(w) dw is uncorrelated with [2% 5(w) dw. For such a 
process, one can calculate the portion of the variance of Y, that is due to cycles 
with frequency less than or equal to some specified value w, through a generalization 
of the procedure used to analyze [6.1.19]. Moreover, this magnitude turns out to 
be given by the expression in [6.1.18]. 
We shall not attempt a proof of the spectral representation theorem here; 
for details the reader is referred to Cramér and Leadbetter (1967, pp. 128-38). 
Instead, the next section provides a formal derivation of a finite-sample version of 
these results, showing the sense in which the sample analog of [6.1.18] gives the 
portion of the sample variance of an observed series that can be attributed to cycles 
with frequencies less than or equal to @,. 


3See, for example, Thomas (1972, p. 176). 


6.1. The Population Spectrum 157 


6.2. The Sample Periodogram 


For a covariance-stationary process Y, with absolutely summable autocovariances, 
we have defined the value of the population spectrum at frequency w to be 


1 < bet 
Sy(w) = on aye, [6.2.1] 
jee 


where 


y= E(Y, a w)(Y,-; = i) 
and «x = E(Y,). Note that the population spectrum is expressed in terms of 
{y,}7.0, which represents population second moments. 
Given an observed sample of T observations denoted y,, y2,..., yr, we can 
calculate up to T — 1 sample autocovariances from the formulas 


ba SO1- NOP) for =012%..., 0-1 
yy; = t=jt 
"UG; forj = -1,-2,...,-T+1, 
[6.2.2] 
where y is the sample mean: 
T 
Pot ys [6.2.3] 


For any given w we can then construct the sample analog of [6.2.1], which is known 
as the sample periodogram: 
T-1 


Dd ger. [6.2.4] 


§,(@) = ey jo at 


As in [6.1.6], the sample periodogram can equivalently be expressed as 
1 
§,(w) = ae + aS) % cose) | [6.2.5] 


The same calculations that led to [6.1.17] can be used to show that the area under 
the periodogram is the sample variance of y: 


ie §,(w) der = %. 


Like the population spectrum, the sample periodogram is symmetric around 
w = 0, so that we could equivalently write 


vo = a §,(@) dw. 


There also turns out to be a sample analog to the spectral representation 
theorem, which we now develop. In particular, we will see that given any T ob- 
servations On a process (y;, y2,..., Yr), there exist frequencies w,, a), ... , wy 
and coefficients i, &,, @,..., Gy, é,, 6, . bu such that the value for y at 
date ¢ can be expressed as 


y= et S {Geos - 1) + §;-sin[w,(t — 1)}}, [6.2.6] 


158 Chapter 6 | Spectral Analysis 


where the variable &;-cos[w,(t — 1)] is orthogonal in the sample to é,-cos[w,(t — 1)] 
for j # k, the variable 5 ‘sinfn/(t ~ 1)] is orthogonal to §,-sin[w,{t — 1)] for j # k, 
and the variable &-cos[w,(t — 1)] is orthogonal to 8,-sin[w,(¢ - 1)] for all j and k. 
The sample variance of y is T-'Z2,(y, — y)*, and the portion of this variance 
that can be attributed to cycles with frequency w, can be inferred from the sample 
periodogram §,(,). 

We will develop this claim for the case when the sample size T is an odd 
number. In this case y, will be expressed in terms of periodic functions with 
M =(T — 1)/2 different frequencies in [6.2.6]. The frequencies w,, w2,... , wy 
are specified as follows: 


wo, = 2n/T 
w, = 4n/T 

. [6.2.7] 
wy = 2MnIT. 


Thus, the highest frequency considered is 


_ AT -1)9 
oor 


Consider an OLS regression of the value of y,on a constant and on the various 
cosine and sine terms, 


y= et > {a;cos[w(t — 1)] + S;sin[w(t -— 1)]} + uy, 


This can be viewed as a standard regression model of the form 
y, = B’x, + uw, [6.2.8] 


where 


xX, = I cos[w,(t — 1)]  sin[@,(¢ - 1)] cos[w(t - 1)] sin[w(¢ - 1)] [6.2.9] 


- cos[wy(t — 1)]  sin[w,,(¢ - oi] 
=[e am & a & +++ ay Syl. [6.2.10] 


Note that x, has (2M + 1) = T elements, so that there are as many explanatory 
variables as observations. We will show that the elements of x, are linearly inde- 
pendent, meaning that an OLS regression of y, on x, yields a perfect fit. Thus, the 
fitted values for this regression are of the form of [6.2.6] with no error term 4y,. 
Moreover, the coefficients of this regression have the property that 4(@? + 8?) 
represents the portion of the sample variance of y that can be attributed to cycles 
with frequency w,. This magnitude $(a? + 5?) further turns out to be proportional 
to the sample periodogram evaluated at w,. In other words, any observed series 
Yi» Y2, - ++ + Yr can be expressed in terms of periodic functions as in [6.2.6], and 
the portion of the sample variance that is due to cycles with frequency w, can be 
found from the sample periodogram. These points are established formally in the 
following proposition, which is proved in Appendix 6.A at the end of this chapter. 


6.2. The Sample Periodogram 159 


Proposition 6.2: Let T denote an odd integer and let M = (T —- 1)/2. Let 


w, = 2njlT forj = 1,2,...,M, and let x, be the (T x 1) vector in [6.2.9]. Then 
T T 0’ 

= : 6.2.11 

2, x i Ae ise] 

Furthermore, let {y,, y2, . . - » Yr} be any T numbers. Then the following are true: 


(a) The value of y, can be expressed as 
M 
y= at >» {é,-cos[w(t -— 1)] + §-sinlw(t — 1)]}, 


with ji = y (the sample mean from [6.2.3]) and 


d 

&, = (2/T) > y,rcos[(t - 1)] forf=1,2,...,M [6.2.12] 
t=1 
T 

8, = (2/T) Dd yrsin{w(t - 1)]  forj = 1,2,...,M. [6.2.13] 
t=l 


(b) The sample variance of y, can be expressed as 
T M 
(UT) & Or — YP = (12) D GP + 87), [6.2.14] 
t= i= 


and the portion of the sample variance of y that can be attributed to cycles of 
frequency w, is given by 3(a? + 97). 

(c) The portion of the sample variance of y that can be attributed to cycles of 
frequency w; can equivalently be expressed as 


(1/2)(a? + 8?) = (4n/T)-8,(@), [6.2.15] 


where §,(c,) is the sample periodogram at frequency w,. 


Result [6.2.11] establishes that 27.,x,x; is a diagonal matrix, meaning that 
the explanatory variables contained in x, are mutually orthogonal. The proposition 
asserts that any observed time series (y,, y2,..., Yr) with T odd can be written 
as a constant plus a weighted sum of (JT — 1) periodic functions with (T — 1)/2 
different frequencies; a related result can also be developed when T is an even 
integer. Hence, the proposition gives a finite-sample analog of the spectral rep- 
resentation theorem. The proposition further shows that the sample periodogram 
captures the portion of the sample variance of y that can be attributed to cycles 
of different frequencies. 

Note that the frequencies w, in terms of which the variance of y is explained 
all lie in (0, 7]. Why aren’t negative frequencies w < 0 employed as well? Suppose 
that the data were actually generated by a special case of the process in [6.1.19], 


Y, = a-cos(—ot) + &-sin(—wt), [6.2.16] 

where ~w < 0 represents some particular negative frequency and where a and 5 

are zero-mean random variables. Since cos(— wt) = cos(wt) and sin(— wt) = —sin(wt), 
the process [6.2.16] can equivalently be written 

Y, = a-cos(wt) — d-sin(wt). [6.2.17] 


Thus there is no way of using observed data on y to decide whether the data are 
generated by a cycle with frequency — w as in [6.2.16] or by a cycle with frequency 


160 Chapter 6 | Spectral Analysis 


cos((x/2)] cos( (34/2) t} 


FIGURE 6.1 Aliasing: plots of cos[(2/2)¢] and cos[(37/2)t] as functions of t. 


+ as in [6.2.17]. It is simply a matter of convention that we choose to focus only 
on positive frequencies. 

Why is w = 7 the largest frequency considered? Suppose the data were 
generated from a periodic function with frequency w > a, say, w = 3a/2 for 
illustration: 


Y, = a-cos[(32/2)t] + 8-sin[(37/2)¢]. [6.2.18] 


Again, the properties of the sine and cosine function imply that [6.2.18] is equivalent 
to 


Y, = a-cos{(—7/2)t] + 8sin{(— 7/2). [6.2.19] 


Thus, by the previous argument, a representation with cycles of frequency (37/2) 
is observationally indistinguishable from one with cycles of frequency (7/2). 

To summarize, if the data-generating process actually includes cycles with 
negative frequencies or with frequencies greater than 77, these will be imputed to 
cycles with frequencies between 0 and 7. This is known as aliasing. 

Another way to think about aliasing is as follows. Recall that the value of 
the function cos(wt) repeats itself every 27/w periods, so that a frequency of w is 
associated with a period of 27/w.* We have argued that the highest-frequency cycle 
that one can observe is # = a. Another way to express this conclusion is that the 
shortest-period cycle that one can observe is one that repeats itself every 27/a = 
2 periods. If w = 37/2, the cycle repeats itself every 4 periods. But if the data are 
observed only at integer dates, the sampled data will exhibit cycles that are repeated 
every four periods, corresponding to the frequency w = 7/2. This is illustrated in 
Figure 6.1, which plots cos[(7/2)t] and cos[(37/2)t] as functions of t. When sampled 
at integer values of ¢, these two functions appear identical. Even though the function 
cos[(37/2)t] repeats itself every time that tincreases by 3, one would have to observe 
y, at four distinct dates (y,, y:41, Yrr2» Yer3) before one would see the value of 
cos[(37/2)t] repeat itself for an integer value of ¢. 


“See Section A.1 of the Mathematical Review (Appendix A) at the end of the book for a further 
discussion of this point. 


6.2. The Sample Periodogram 161 


Note that in a particular finite sample, the lowest frequency used to account 
for variation in y is #, = 27/T, which corresponds to a period of T. If a cycle takes 
longer than T periods to repeat itself, there is not much that one could infer about 
it if one has only T observations available. 

Result (c) of Proposition 6.2 indicates that the portion of the sample variance 
of y that can be attributed to cycles of frequency w, is proportional to the sample 
periodogram evaluated at w,, with 47/T the constant of proportionality. Thus, the 
proposition develops the formal basis for the claim that the sample periodogram 
reflects the portion of the sample variance of y that can be attributed to cycles of 
various frequencies. 

Why is the constant of proportionality in [6.2.15] equal to 47/T? The pop- 
ulation spectrum sy(w) could be evaluated at any w in the continuous set of points 
between 0 and 7. In this respect it is much like a probability density f(x), where 
X is a continuous random variable. Although we might loosely think of the value 
of f(x) as the “probability” that X = x, it is more accurate to say that the integral 

= fx(x) dx represents the probability that X takes on a value between x, and x. 
As x2 — x, becomes smaller, the probability that X will be observed to lie between 
x, and x, becomes smaller, and the probability that X would take on precisely the 
value x is effectively equal to zero. In just the same way, although we can loosely 
think of the value of sy(w) as the contribution that cycles with frequency w make 
to the variance of Y, it is more accurate to say that the integral 


| _sr(@) he [, "" dsy(ea) dea 


represents the contribution that cycles of frequency less than or equal to w, make 
to the variance of Y, and that [%2 2sy(w) dw represents the contribution that cycles 
with frequencies between w, and w, make to the variance of Y. Assuming that 
Sy(@) is continuous, the contribution that a cycle of any particular frequency w 
makes is technically zero. 

Although the population spectrum sy(w) is defined at any in (0, 7], the 
representation in [6.2.6] attributes all of the sample variance of y to the particular 
frequencies @,, @2,..., @y. Any variation in Y that is in reality due to cycles 
with frequencies other than these M particular values is attributed by [6.2.6] to 
one of these M frequencies. If we are thinking of the regression in [6.2.6] 
as telling us something about the population spectrum, we should interpret 
4(4? + 6?) not as the portion of the variance of Y that is due to cycles with 
frequency exactly equal to w,, but rather as the portion of the variance of Y that 
is due to cycles with frequency near w;. Thus [6.2.15] is not an estimate of the 
height of the population spectrum, but an estimate of the area under the pop- 
ulation spectrum. 

This is illustrated in Figure 6.2. Suppose we thought of #(@? + 6?) as an 
estimate of the portion of the variance of Y that is due to cycles with frequency 
between w,_, and w,, that is, an estimate of 2 times the area under Sy(@) between 
@;_, and w,. Since w, = 2aj/T, the difference w; — w;_, is equal to 2a/T. If §,(w) 
is an estimate of s,(w,), then the area under sy(w) between w,_, and w, could be 
approximately estimated by the area of a rectangle with width 27/T and height 
§,(w;). The area of such a rectangle is (27/T)-§,(w,). Since 2(4? + 5?) is an estimate 
of 2 times the area under sy(w) between w,_, and w,, we have $(4? + 6?) = 
(4n/T)-§,(w,), as claimed in equation [6.2.15]. 

Proposition 6.2 also provides a convenient formula for calculating the value of 
the sample periodogram at frequency #, = 27j/T forj = 1,2,...,(T — 1), 


162 Chapter 6 | Spectral Analysis 


0 a, @ PIN o 


FIGURE 6.2 The area under the sample periodogram and the portion of the 
variance of y attributable to cycles of different frequencies. 


namely, 
§,(@)) = [T/(87)](a? + 8?), 
where 
T 
&, = (2/T) 2, yecoslen(t =] 
8, = (2/T) >> y-sin[w(t — 1)]. 
That is, 


§,(w) = mal [2 yrcosl,(t — oy] + [s yy sinfo,(t — m1] } 


6.3. Estimating the Population Spectrum 


Section 6.1 introduced the population spectrum s,(w), which indicates the portion 

of the population variance of Y that can be attributed to cycles of frequency . 

This section addresses the following question: Given an observed sample {y,, y2, 
. » Yrt, how might sy(w) be estimated? 


Large-Sample Properties of the Sample Periodogram — 


One obvious approach would be to estimate the population spectrum sy(w) 
by the sample periodogram §,(w#). However, this approach turns out to have some 


6.3. Estimating the Population Spectrum 163 


serious limitations. Suppose that 


Y, = > Wier 
j=0 


where {i} is absolutely summable and where {e,}/_ _.. is an i.i.d. sequence with 
E(e,) = O and E(e?) = o7. Let sy(@) be the population spectrum defined in [6.1.2], 
and suppose that sy(w) > 0 for all w. Let §,(w) be the sample periodogram defined 
in [6.2.4]. Fuller (1976, p. 280) showed that for » # 0 and a sufficiently large 
sample size T, twice the ratio of the sample periodogram to the population spectrum 
has approximately the following distribution: 


2:8, (w) 


= y(2). 6.3.1 
hia (2) [6.3.1] 
Moreover, if A # w, the quantity 
2:8, (A) 
— 6.3.2 
Sr(A) 63.2) 


also has an approximate y?(2) distribution, with the variable in [6.3.1] approxi- 
mately independent of that in [6.3.2]. 
Since a y?(2) variable has a mean of 2, result [6.3.1] suggests that 


7 2] a 


Sy(w) 
or since s(w) is a population magnitude rather than a random variable, 
E[s,(a)] = sy(@). 
Thus, if the sample size is sufficiently large, the sample periodogram affords an 
approximately unbiased estimate of the population spectrum. 

Note from Table B.2 that 95% of the time, a y7(2) variable will fall between 
0.05 and 7.4. Thus, from [6.3.1], §,(w) is unlikely to be as small as 0.025 times the 
true value of sy(w), and §,(w) is unlikely to be any larger than 3.7 times as big as 
sy (w). Given such a large confidence interval, we would have to say that §,(w) is 
not an altogether satisfactory estimate of sy(w). 

Another feature of result [6.3.1] is that the estimate §,(w) is not getting any 
more accurate as the sample size T increases. Typically, one expects an econometric 
estimate to get better and better as the sample size grows. For example, the variance 
for the sample autocorrelation coefficient f, given in [4.8.8] goes to zero as T > ~, 
so that given a sufficiently large sample, we would be able to infer the true 
value of p; with virtual certainty. The estimate §,(w) defined in [6.2.4] does not 
have this property, because we have tried to estimate as many parameters (70, 7, 

- » Yr-1) as we had observations (y,, yo, .-. Yr) 


Parametric Estimates of the Population Spectrum 


Suppose we believe that the data could be represented with an ARMA(p, q) 
model, 


Ye= B+ GY. + ¥i-2 + +++ + bpYi-p + & + HE-1 [6.3.3] 

+ O62 + °° + Oe, 
where , is white noise with variance a”. Then an excellent approach to estimating 
the population spectrum is first to estimate the parameters p, ¢,,..., b,, %, 


164 Chapter 6 | Spectral Analysis 


, 9, and a? by maximum likelihood as described i - the previous chapter. The 
maximum likelihood estimates (dr, - - - + by, 6, . . , 6,,67) could then be plugged 
into a formula such as [6.1.14] to estimate the population spectrum sy(w) at any 
frequency w. If the Bere is correctly specified, the maximum likelihood estimates 
(dy. -5, by; Bes &?) will get closer and closer to the true values as the 
sample size grows; fae “the resulting estimate of the population spectrum should 
have this same property. 

Even if the model is incorrectly specified, if the autocovariances of the true 
process are reasonably close to those for an ARMA(p, q) specification, this pro- 
cedure should provide a useful estimate of the population spectrum. 


Nonparametric Estimates of the Population Spectrum 


The assumption in [6.3.3] is that Y, can be reasonably approximated by an 
ARMA(p, q) process with p and q small. An alternative assumption is that sy(w) 
will be close to sy(A) when w is close to A. This assumption forms the basis for 
another class of estimates of the population spectrum known as nonparametric or 
kernel estimates. 

If sy(@) is close to sy(A) when @ is close to A, this suggests that sy(w) might 
be estimated with a weighted average of the values of §,(A) for values of A in a 
neighborhood around w, where the weights depend on the distance between w and 
A. Let §y(w) denote such an estimate of sy(w) and let w, = 27j/T. The suggestion 
is to take 


$(@) = Be 1 (0 m> ©)°S, (+m): [6.3.4] 


Here, h is a bandwidth parameter indicating how many different frequencies {w;. 1, 
2, +++, @x,} are viewed as useful for estimating sy(w;). The kernel «(@;+m, ©) 
indicates how much weight each frequency is to be given. The kernel weights sum 
to unity: 

h 


> ke O54 m> w) = 1 


m=z-h 


One approach is to take k(w;,m, @;) to be proportional toh + 1 — |m|. One 
can show that® 


» [A +1 - |m|] = (h + 1)?. 


m=-h 


Hence, in order to satisfy the property that the weights sum to unity, the proposed 
kernel is 


_h+1—(|m| 


K(;4ms 9%) = ht? [6.3.5] 


‘Notice that 
h A A 
2 [A+i-bmil= 2 a+ y- Il 


=(h+1) my - 235 
= (2h + 1)(h + 1) — 2h(h + 12 
=(h+1). 


6.3. Estimating the Population Spectrum 165 


and the estimator [6.3.4] becomes 


P & [h+1—-|m 
Sy(@) = ae a §,(@; 4m): [6.3.6] 
For example, for h = 2, this is 
Sy(@) = 38,(@;_2) + 58,(@;-1) + 38, (w) + 38,( 41) + A8,(@;42). 

Recall from [6. 3.1] and [6.3.2] that the estimates §,(w) and §,(A) are approximately 
independent in large samples for » # A. Because the kernel estimate averages 
over a number of different frequencies, it should give a much better estimate than 
does the periodogram. 

Averaging §,(@) over different frequencies can equivalently be represented 
as multiplying the jth sample autocovariance 4, for j > 0 in the formula for the 
sample periodogram [6.2.5] by a weight «*. For example, consider an estimate of 


the spectrum at frequency that is obtained by taking a simple average of the 
value of §,(A) for A between w — vandw + »v: 


Sy(w) = (2v)? [vs *” g(a) dA. [6.3.7] 


Substituting [6.2.5] into [6.3.7], such an estimate could equivalently be expressed 
as 


§y(@) = (4v7)7 | E +2 » ioe] dr 


= (4vm)-(2v)% + (2vm)- 5 Hy [sina]? *” [6.3.8] 


= (217)- "4 + Qvm)'S 4(1//)-{sin[(@ + v)j] — sin{( — v)j]}. 


Using the trigonometric identity® 
sin(A + B) — sin(A — B) = 2-cos(A)-sin(B), [6.3.9] 
expression [6.3.8] can be written 


iw) = = (217) 14 + (2um)-} 2 4,(1//)-[2-cos(wj)-sin(vj)] 
[6.3.10] 
= (27)™ ‘ + 25, [a in Sn) | cowwi). 
Notice that expression [6.3.10] is of the following form: 
Sy(@) = (27) 7? {i + 25 cr acoxoi [6.3.11] 
where 
Kt = [se [6.3.12] 


The sample periodogram can be regarded as a special case of [6.3.11] when «¥ = 
Expression [6.3.12] cannot exceed 1 in absolute value, and so the estimate [6. 3.11] 
essentially downweights 7, relative to the sample periodogram. 


‘See, for example, Thomas (1972, pp. 174-75). 


166 Chapter 6 | Spectral Analysis 


Recall that sin(7j) = 0 for any integer j. Hence, if v = 7, then «7 = 0 for 
all j and [6.3.11] becomes 


$y(w) = (277)~ Fp. [6.3.13] 


In this case, all autocovariances other than f, would be shrunk to zero. When v = 
a, the estimate [6.3.7] is an unweighted average of §,(A) over all possible values 
of A, and the resulting estimate would be the flat spectrum for a white noise process. 

Specification of a kernel function x(w;,,,, @;) in [6.3.4] can equivalently be 
described in terms of a weighting sequence {«*}%=} in [6.3.11]. Because they are 
just two different representations for the same idea, the weight x;* is also sometimes 
called a kernel. Smaller values of x* impose more smoothness on the spectrum. 
Smoothing schemes may be chosen either because they provide a convenient speci- 
fication for «(@;, 4, @;) or because they provide a convenient specification for «;*. 

One popular estimate of the spectrum employs the modified Bartlett kernel, 
which is given by 


qgt+l _ [6.3.14] 


The Bartlett estimate of the spectrum is thus 


$y(@) = (20)7} {ie + 25 [1 — jg + Dives}. [6.3.15] 


Autocovariances +; for j > q are treated as if they were zero, or as if Y, followed 
an MA(q) process. For j = q, the estimated autocovariances 4, are shrunk toward 
zero, with the shrinkage greater the larger the value of j. 

How is one to choose the bandwidth parameter A in [6.3.6] or q in [6.3.15]? 
The periodogram itself is asymptotically unbiased but has a large variance. If one 
constructs an estimate based on averaging the periodogram at different frequencies, 
this reduces the variance but introduces some bias. The severity of the bias depends 
on the steepness of the population spectrum and the size of the bandwidth. One 
practical guide is to plot an estimate of the spectrum using several different band- 
widths and rely on subjective judgment to choose the bandwidth that produces the 
most plausible estimate. 


6.4. Uses of Spectral Analysis 


We illustrate some of the uses of spectral analysis with data on manufacturing 
production in the United States. The data are plotted in Figure 6.3. The series is 
the Federal Reserve Board’s seasonally unadjusted monthly index from January 
1947 to November 1989. Economic recessions in 1949, 1954, 1958, 1960, 1970, 
1974, 1980, and 1982 appear as roughly year-long episodes of falling production. 
There are also strong seasonal patterns in this series; for example, production 
almost always declines in July and recovers in August. 

The sample periodogram for the raw data is plotted in Figure 6.4, which 
displays §,(w;) as a function of j where w, = 27j/T. The contribution to the sample 
variance of the lowest-frequency components (j near zero) is several orders of 
magnitude larger than the contributions of economic recessions or the seasonal 
factors. This is due to the clear upward trend of the series in Figure 6.3. Let y, 


6.4. Uses of Spectral Analysis 167 


140 
120 
100 
80 
60 
40 


20 
47 SI SS sg 63 67 7) 7S 73 83 87 


FIGURE 6.3 Federal Reserve Board’s seasonally unadjusted index of industrial 
production for U.S. manufacturing, monthly 1947:1 to 1989:11., 


30000 = 


25000 


20000 


15000 


10000 


S000 


| 23 45 67 83 Wi 133 {SS 177 199 221 243 
Volue of j 


FIGURE 6.4 Sample periodogram for the data plotted in Figure 6.3. The figure 
plots §,(w;) as a function of j, where w, = 27j/T. 


168 Cha ter 6 | Spectral Anal sis 


7.5 


5.0 
25 
0.0 5 
1 val 4l 61 81 10] 121 141 161 1B1 = 201 22) 
Volue of j 


FIGURE 6.5 Estimate of the spectrum for monthly growth rate of industrial 
production, or spectrum of 100 times the first difference of the log of the series in 
Figure 6.3. 


represent the series plotted in Figure 6.3. If one were trying to describe this with 
a sine function 


y, = &sin(wt), 


the presumption would have to be that w is so small that even at date t = T the 
magnitude wT would still be less than 7/2. Figure 6.4 thus indicates that the trend 
or low-frequency components are by far the most important determinants of the 
sample variance of y. 

The definition of the population spectrum in equation [6.1.2] assumed that 
the process is covariance-stationary, which is not a good assumption for the data 
in Figure 6.3. We might instead try to analyze the monthly growth rate defined by 


x, = 100-[log(y,) — log(y,—1)]- [6.4.1] 


Figure 6.5 plots the estimate of the population spectrum of X as described in 
equation [6.3.6] with h = 12. 

In interpreting a plot such as Figure 6.5 it is often more convenient to think 
in terms of the period of a cyclic function rather than its frequency. Recall that if 
the frequency of a cycle is w, the period of the cycle is 27/w. Thus, a frequency 
of w; = 2nj/T corresponds to a period of 27/w; = T/j. The sample size is T = 513 
observations, and the first peak in Figure 6.5 occurs around j = 18. This corresponds 
to a cycle with a period of 513/18 = 28.5 months, or about 23 years. Given the 
dates of the economic recessions noted previously, this is sometimes described as 
a “business cycle frequency,” and the area under this hill might be viewed as tellmg 
us how much of the variability in monthly growth rates is due to economic reces- 
sions. 


6.4. Uses of Spectral Analysis 169 


The second peak in Figure 6.5 occurs at j = 44 and corresponds to a period 
of 513/44 = 11.7 months. This is natural to view as a 12-month cycle associated 
with seasonal effects. The four subsequent peaks correspond to cycles with periods 
of 6, 4, 3, and 2.4 months, respectively, and again seem likely to be picking up 
seasonal and calendar effects. 

Since manufacturing typically falls temporarily in July, the growth rate is 
negative in July and positive in August. This induces negative first-order serial 
correlation to the series in [6.4.1] and a variety of calendar patterns for x, that may 
account for the high-frequency peaks in Figure 6.5. An alternative strategy for 
detrending would use year-to-year growth rates, or the percentage change between 
y, and its value for the corresponding month in the previous year: 


w, = 100-[log(y,) — log(y,-12)]. [6.4.2] 


The estimate of the sample spectrum for this series is plotted in Figure 6.6. 
When the data are detrended in this way, virtually all the variance that remains is 
attributed to components associated with the business cycle frequencies. 


Filters 


Apart from the scale parameter, the monthly growth rate x, in [6.4.1] is 
obtained from log(y,) by applying the filter 


x, = (1 — L) log(y,), [6.4.3] 


where L is the lag operator. To discuss such transformations in general terms, let 
Y, be any covariance-stationary series with absolutely summable autocovariances. 


30 


80 ~ 


I 2i 4] 6} Bl 101 121 141 161 181 201 221 
Value of j 


FIGURE 6.6 Estimate of the spectrum for year-to-year growth rate of monthly 
industrial production, or spectrum of 100 times the seasonal difference of the log 
of the series in Figure 6.3. 


170 Chapter 6 | Spectral Analysis 


Denote the autocovariance-generating function of Y by gy(z), and denote the 
population spectrum of Y by sy(w). Recall that 


Sy(w) = (277) “!gy(e-). ; [6.4.4] 
Suppose we transform Y according to 
X,= A(L)Y,, 
where 
A(L) = > ALi 
juz 
and 


o> |h,| <<. 


jars 


Recall from equation [3.6.17] that the autocovariance-generating function of X can 
be calculated from the autocovariance-generating function of Y using the formula 


8x(z) = A(z)h(z~")gy(z). [6.4.5] 
The population spectrum of X is thus 
Sx(@) = (217) ~"gx(e~) = (277) “"h(e~™)h(e)gy(e-). [6.4.6] 


Substituting [6.4.4] into [6.4.6] reveals that the population spectrum of X is related 
to the population spectrum of Y according to 


Sx(w) = h(e~)h(e)sy(w). [6.4.7] 
Operating on a series Y, with the filter A(L) has the effect of multiplying the 
spectrum by the function h(e-)h(e). 


For the difference operator in [6.4.3], the filter is h(L) = 1 — L and the 
function h(e-')h(e™) would be 


h(e-)h(e) = (1 — e~ (1 — el) 
1-e™—e4+1 [6.4.8] 


= 2 — 2-cos(w), 


where the last line follows from [6.1.11]. If X, = (1 — L)Y,, then, to find the 
value of the population spectrum of X at any frequency , we first find the value 
of the population spectrum of Y at w and then multiply by 2 — 2-cos(w). For 
example, the spectrum at frequency w = 0 is multiplied by zero, the spectrum at 
frequency » = 7/2 is multiplied by 2, and the spectrum at frequency w = 7 is 
multiplied by 4. Differencing the data removes the low-frequency components and 
accentuates the high-frequency components. 

Of course, this calculation assumes that the original process Y, is covariance- 
stationary, so that s,{w) exists. If the original process is nonstationary, as appears 
to be the case in Figure 6.3, the differenced data (1 — L)Y, in general would not 
have a population spectrum that is zero at frequency zero. 

The seasonal difference filter used in [6.4.2] is h(L) = 1 — L®, for which 


h(e~“)h(e™) 


(1 “aaah e712] = eltiw) 
1- e7 io _ eltio + 1 
2 — 2-cos(12w). 


Mi 


6.4. Uses of Spectral Analysis 171 


This function is equal to zero when 12w = 0, 27, 47, 62, 87, 107, or 127; that 
is, it is zero at frequencies w = 0, 27/12, 4q/12, 677/12, 877/12, 107/12, and 7. 
Thus, seasonally differencing not only eliminates the low-frequency (w = 0) com- 
ponents of a stationary process, but further eliminates any contribution from cycles 
with periods of 12, 6, 4, 3, 2.4, or 2 months. 


Composite Stochastic Processes 


Let X, be covariance-stationary with absolutely summable autocovariances, 
autocovariance-generating function gx(z), and population spectrum sx (w). Let W, 
be a different covariance-stationary series with absolutely summable autocovari- 
ances, autocovariance-generating function gy(z), and population spectrum sy(w), 
where X, is uncorrelated with W, for all ¢ and 7. Suppose we observe the sum of 
these two processes, 


Y, = X, + W,. 
Recall from [4.7.19] that the autocovariance-generating function of the sum is the 
sum of the autocovariance-generating functions: 
8y(z) = 8x(z) + 8w(2). 
It follows from [6.1.2] that the spectrum of the sum is the sum of the spectra: 
Sy(@) = sx(w) + sy(o). [6.4.9] 


For example, if a white noise series W, with variance o” is added to a series X, and 
if X, is uncorrelated with W, for all t and 7, the effect is to shift the population 
spectrum everywhere up by the constant o?/(27). More generally, if X has a peak 
in its spectrum at frequency w, and if W has a peak in its spectrum at w,, then 
typically the sum X + W will have peaks at both w, and a. 

As another example, suppose that 


Y,=c+ DAX;+ &, 
j=z-@ 


where X, is covariance-stationary with absolutely summable autocovariances and 
spectrum sx(w). Suppose that the sequence {h,}* _.. is absolutely summable and 
that e, is a white noise process with variance o* where ¢ is uncorrelated with X at 
all leads and lags. It follows from [6.4.7] that the random variable 37 _..4;X,_, 
has spectrum h(e~'*)h(e)s,(w), and so, from [6.4.9], the spectrum of Y is 


sy(w) = h(e-*)h(e)sy(w) + o?/(27). 


APPENDIX 6.A. Proofs of Chapter 6 Propositions 
@ Proof of Proposition 6.1. Notice that 


a Sy(w)e* dw = = > yje7me’*® dw 


june 


eer 


= + Dw] teoslack - jy] + esinfote - yp de 
[6.4.1] 


172 Chapter 6 | Spectral Analysis 


Consider the integral in [6.A.1]. For k = j, this would be 
i {cos[w(k — j)] + isinfw(k — f)]} dw = if {cos(0) + i-sin(0)} dw 


[ (6.A.2] 


tt 


For k # j, the integral in [6.4.1] would be 
i ” fcosfa(k — j)] + ésinfu(k — j)]} do 


= slot =f alt A ee) - p)|" 


@=a-T k-j 


(6.A.3] 


= (k a. Ghee 9) Sie a 
— boosfa(k — j)] + Feodl—a(k — AD 


But the difference between the frequencies m(k — j) and —a(k — j) is 2m(k — j), which 
is an integer multiple of 27. Since the sine and cosjne functions are periodic, the magnitude 
in [6.4.3] is zero. Hence, only the term for j = k in the sum in [6.A.1] is nonzero, and 
using [6.4.2], this sum is seen to be 


a Sy (we dw = a Ve f [{cos(0) + é-sin(0)] dw = ¥,, 


as claimed in [6.1.15]. 
To derive [6.1.16], notice that since s,(w) is symmetric around w = 0, 


iz Sy(w)e™ dw = f°. Sy(w)e™ dw + [ Sy(w)e™ dw 
= i Sy(—w)e7* dw + ie Sy(w)e* dw 
= { Sy(w)(e-* + el) dw 
= [ ” 5 -(w)-2-cos(cak) dav, 
where the last line follows from [6.1.11]. Again appealing to the symmetry of s,(w), 
[F se(w)-2-costwk) do = {" s4(w) eos(ak) do, 
so that 
i e Sy(w)e* dw = { be Sy(w) cos(wk) dw, 
as claimed. 


@ Derivation of Equation [6.2.11] in Proposition 6.2. We begin by establishing the following 
result: 


for s 


[6.4.4] 


S explitans/ T(t -)) = {e : 


for s 1,+2,...,+4(T-1 


That [6.4.4] holds for s = 0 is an immediate consequence of the fact that exp(0) = 1. To 
see that it holds for the other cases in [6.A.4], define 


exp[i(2as/T)]. [6.4.5] 


Appendix 6.A. Proofs of Chapter 6 Propositions 173 


Then the expression to be evaluated in [6.4.4] can be written 


T T 
Dd expli(2as/TH(t — 1] = Dd z6-%. [6.4.6] 
m1 r= 
We now show that for any N, 
N 1 — oN 
>> zed = i = ; (6.A.7] 


provided that z # 1, which is the case whenever 0 < |s| < T. Expression [6.4.7] can be 
verified by induction. Clearly, it holds for N = 1, for then 


N 
> z@-) = z© = 4, 


t= 


Given that [6.4.7] holds for N, we see that 


N+1 N 
> ze-D = Dee 4 2h 
r=l t=1 
1-— 2% 
= ——— + 2% 
1-2z 
_i-2z% +2" - 2) 
1-z 
1— zt 
~ fez’? 


as claimed in (6.A.7]. 
Setting N = T in [6.A.7] and substituting the result into [6.4.6], we see that 


a 1—z7 
> expli(2as/T)(¢t — 1)] = = [6.4.8] 
tml — 
for 0 < |s| < T, But it follows from the definition of z in [6.A.5] that 
z7 = exp{i(27s/T)-T] 

= exp{i(27s)] es [6.4.9] 

= cos(2as) + i-sin(27s) 

=1 fors = +1, +2,...,+(T- 1). 


Substituting [6.4.9] into [6.4.8] produces 
Tv 
> expfi(2as/Tt -—1)] =0 fors = +1, +2,..., #(T- 0), 
tel 


as claimed in [6.4.4]. 
To see how [6.4.4] can be used to deduce expression [6.2.11], notice that the first 
column of 27.,x,x; is given by 


T 
>> cos[w,(t — 1)] 
> sin{w,(¢ — 1)] 
. ; {6.A.10] 
> cosfwu(t — 1)] 
> sin{wy(t — 1)] 


where © indicates summation over ¢ from 1 to T. The first row of 27_,x,x; is the transpose 
of [6.4.10]. To show that all the terms in [6.A.10] other than the first element are zero, 


174 Chapter 6 | Spectral Analysis 


we must show that 


T 
2, cosfen (¢ -))=0 forj=1,2,...,M (6.A.11] 


and 
T 


> sinfw(t — D) = 0 — forj 


1,2,...,M [6.4.12] 


for w, the frequencies specified in [6.2.7]. But [6.4.4] establishes that 


0= > exp[i(27j/T)(t — 1)] 


{(6.A.13] 
T T 
= D> cos[{(2aj/T\(t — 1)] + & Dd sin{(2rj/T\(t - 0] 
t=1 cl 
forj = 1,2,..., M. For [6.4.13] to equal zero, both the real and the imaginary component 


must equal zero. Since w, = 27j/T, results [6.4.11] and [6.4.12] follow immediately from 
[6.4.13]. 

Result [6.4.4] can also be used to calculate the other elements of 27_,x,x/. To see 
how, note that 


He te“ = Hcos(6) + i-sin(@) + cos(@) — isin(6)] 
= cos(6) 


[6.4.14] 


and similarly 


ie -e = +{c0s(6) + isin(@) — {cos(@) — é-sin(6)]} 


[6.4.15] 
= sin(@). 


Thus, for example, the elements of 27. ,x,x; corresponding to products of the cosine terms 
can be calculated as 


> costa, ~ 1)]-cos[w,(t — 1)] 
=5 > {exp[iw,(t — 1)] + exp[—iw,(t ~ 1)]} 
x f{exp[io.(t — 1] + exp[-iw,(t — 1)]} 


=324 > {exp[i(w; + w,)(¢ - 1)] + exp[i(—o, + ,)(t - 1)] 


[6.4.16] 
+ expfi(a, — «,)(¢ — 1)] + exp[i(—a, — «,)(¢ - 1)]} 
= FD ferplieai TN + He =D] + explia TINK ~ Ne - ) 
+ expli(2a/T)(j — ke — 1)] + expfi(2a/T)(-j - ke — 1}. 
For any j = 1, 2, , M andanyk = 1,2,..., M where k # j, expression [6.4.16] is 


zero by virtue of (6. we 4). For k = j, the first and last sums in the last line of tis. A.16] are 
zero, so that the total is equal to 


(1/4) ya +))= 


Appendix 6.A. Proofs of Chapter 6 Propositions 175 


Similarly, elements of 27. ,x,x; corresponding to cross products of the sine terms can 
be found from 


> sin{w,(t — 1)]-sinfw,(t — 1)] 


-3 2 {exp{iw(t — 1)] — exp[—iwt - 1} 


x {exp[iw,(t —1)] — exp[—ia,(¢ — 1)]} 


>? > fexpli(2a/T)(j + KN(t - 1)] ~ explia/TV(k - Ne - DI 


~ expli(2/T)j — k(t -— 1)] + expli(2a/T)(-j — ke - nr 
_ i forj =k 
mn (0) otherwise. 


Finally, elements of 27_,x,x; corresponding to cross products of the sine and cosine 
terms are given by 


) cos[w,(t — 1)}-sin{w,(¢ - 1] 


= © > fexplio(t - 1)) + expl-in(e — 9} 
x { 


{exp{iw,(¢ — 1)] — exp[—io,(t — 1)]} 
= Gb (emmlieai TN] + WG - 0) + expliamiT(k - Ne - 0) 
~ exp[i(2a/T)(j — k)e — 1)) — expfi(2a/T)(-j — ke - YI, 
which equals zero for all j and k. This completes the derivation of [6.2.11]. @ 


@ Proof of Proposition 6.2(a). Let b denote the estimate of B based on OLS estimation of 
the regression in [6.2.8]: 


b= {3 xxi} {3 x} 
T 0’ “ler 
7 E sae {3 x4 [6.4.17] 
T"! 0! r 
: 0 emu Be) 


But the definition of x, in [6.2.9] implies that 


T 


>» XY, = [= yr 2 yrcos[w(t o 1)] bY yesin[w,(t =: 1)] 
D yecosfox(t — 1) D yesinfeg(e - y) +: [6.4.18] 
> yecos[ an, (t zy 1)] > yesinfons(t ~ a. 


where © again denotes summation over ¢ from 1 to T. Substituting [6.4.18] into [6.4.17] 
produces result (a) of Proposition 6.2. @ 


@ Proof of Proposition 6.2(b). Recall from expression [4.A.6] that the residual sum of 
squares associated with OLS estimation of [6.2.8] is 


T T T T cael ES 
3 Z- [Syl Saai] [San] teas 
rl emt e=l e=l twl 


116 Chanter 6 | Snectral Analysis 


Since there are as many explanatory variables as observations and since the explanatory 
variables are linearly independent, the OLS residuals &, are all zero. Hence, [6.A.19] implies 
that 


Tr T T ra be 
Pye ie [3 va |[3 xx! [3 «| [6.4.20] 
But [6.4.17] allows us to write 
T T 0’ 
2 XY, = f eee {6.A.21] 


Substituting [6.4.21] and [6.2. - into [6.4.20] establishes that 


T ; a1 T 0 
2 yi a Ie ae es i cel 


tat 
- T 
7 E es I, } 


T-p? + (T/2) > (47 + 8) 


j=l 


i 
a 
————— 
oy 


tt 


so that 
T M 
(1/T) > y? = A? + (1/2) 2 (a2 + 8). [6.4.22] 
t= ix 
Finally, observe from [4.A.5] and the fact that @ = ¥ that 
T T 
(1/7) Dy? — # = QT) D  - 
allowing [6.4.22] to be written as 
T M 
(1/7) > (y. — YP = (2) >> (47 + 8), 


as claimed in [6.2.14]. Since the regressors are all orthogonal, the term $(4? + 8?) can be 
interpreted as the portion of the sample variance that can be attributed to the regressors 
cos[w,(t — 1)] and sin{w(t - 1). @ 


@ Proof of Proposition 6.2(c). Notice that 
(a? + 8?) = (4, + £-8)(4, — i-8)). [6.4.23] 
But from result (a) of Proposition 6.2, 


T Tv 
4, = (2/7) > yroos{o(t ~ 1)] = QIT) Y (y ~ yy-cos(e - Y], [6.4.24] 


where the second equality follows from [6.4.11]. Similarly, 


= (2/T) > (y, — ¥)-sinfo,(t -— 1]. [6.4.25] 


It follows from [6.4.24] and [6.A.25] that 
T 
a, + ib, = emn{> (y, — F¥)-cos[w,(t — 1] 
a >> (y, ~ ¥)sinfw,(t — vi} - [6.4.26] 
= (2/T) > (y, — ¥)-expfiw(t — 1]. 


Appendix 6.A. Proofs of Chapter 6 Propositions 177 


Similarly, 
é&, — i-8, = (2/T) >» (y, — ¥)exp[—iw(r — 1)]. [6.4.27] 
Substituting [6.4.26] and [6.4.27] into [6.4.23] produces 
a? + # = cir (y, — ¥)-expfio(t - vi} 
x {3 (y, — ¥)-exp[— iar — oi} 
= 47?) DD ONO. ~ Yexplin( ~ 7] 
= ir ion = yy? + Dy on — YO — ¥)-exp[ —iw,] 
oe > O, =< Ive — ¥)-exp[ia,] 
+ 30. — WUe2 — D-exp[-2ie] 
+ DO Wa — Fyrexp[2ia] +--+ [6.4.28] 
+ (yn — Yr — ¥)-exp[-(T — Die] 
+ (yr — YC — ¥)-expl(T - Di] 


a in| + Frexp[—io,)] + $_,explia)] 
+ fyexp[—2iw] + f_2-exp[2iw,] + --- 


+ Hr_vexp[—(T — lia] + ¥-r+sexp[(T — Dio a} 
= (4/T)2m)5,(w)), 


from which equation [6.2.15] follows. ™ 


Chapter 6 Exercises 


6.1. Derive [6.1.12] directly from expression [6.1.6] and the formulas for the autocovariances 
of an MA(1) process. 


6.2. Integrate [6.1.9] and [6.1.12] to confirm independently that [6.1.17] holds for white 
noise and an MA(1) process. 


Chapter 6 References 


Anderson, T. W. 1971. The Statistical Analysis of Time Series. New York: Wiley. 
Bloomfield, Peter. 1976. Fourier Analysis of Time Series: An Introduction. New York: Wiley. 


178 Chapter 6 | Spectral Analysis 


Cramér, Harald, and M. R. Leadbetter. 1967. Stationary and Related Stochastic Processes. 
New York: Wiley. 


Fuller, Wayne A. 1976. Introduction to Statistical Time Series. New York: Wiley. 


Thomas, George B., Jr. 1972. Calculus and Analytic Geometry, alternate ed. Reading, Mass.: 
Addison-Wesley. 


Chapter 6 References 179 


o) 


Asymptotic Distribution 
Theory 


Suppose a sample of T observations (y,, yo, - . - , Yr) has been used to construct 
6, an estimate of the vector of population parameters. For example, the parameter 
vector @ = (c, ¢;, $2, .-., %,, 07)’ for an AR(p) process might have been 
estimated from an OLS regression of y, on lagged y’s. We would like to know how 
far this estimate 6 is likely to be from the true value @ and how to test a hypothesis 
about the true value based on the observed sample of y’s. 

Much of the distribution theory used to answer these questions is asymptotic: 
that is, it describes the properties of estimators as the sample size (T) goes to 
infinity. This chapter develops the basic asymptotic results that will be used in 
subsequent chapters. The first section summarizes the key tools of asymptotic 
analysis and presents limit theorems for the sample mean of a sequence of i.i.d, 
random variables, Section 7.2 develops limit theorems for serially dependent var- 
iables with time-varying marginal distributions. 


7.1. Review of Asymptotic Distribution Theory 


Limits of Deterministic Sequences 


Let {c;}7., denote a sequence of deterministic numbers. The sequence is said 
to converge to c if for any e > 0, there exists an N such that |c; — c| < ¢ whenever 
T = N; in other words, cz will be as close as desired to c so long as T is sufficiently 
large. This is indicated as 


lim cr =, [7.1.1] 


Toe 
or, equivalently, 
cr ¢. 
For example, c; = 1/T denotes the sequence {1, 3, 4, . . .}, for which 


lim cr = 0. 


Foz 


A sequence of deterministic (m x n) matrices {C;}7., converges to C if each 
element of C; converges to the corresponding element of C. 


180 


Convergence in Probability 


Consider a sequence of scalar random variables, {X;}7.,. The sequence is 
said to converge in probability to c if for every e > 0 and every 6 > 0 there exists 
a value N such that, for all T = N, 


PX; - cl > d}<e. [7.1.2] 


In words, if we go far enough along in the sequence, the probability that X differs 
from c by more than 6 can be made arbitrarily small for any 6. 

When [7.1.2] is satisfied, the number c is called the probability limit, or plim, 
of the sequence {X7}. This is indicated as 


plim X; = c, 


or, equivalently, 
xX nes Cc. 


Recall that if {c;}Z_, is a deterministic sequence converging to c, then there 
exists an N such that jc; — c| < 6 for all T= N. Then Pfc; — c| > 8} = 0 for 
all T = N. Thus, if a deterministic sequence converges to c, then we could also 
say that c;—> c. 

A sequence of (m X n) matrices of random variables {X} converges in 
probability to the (m x n) matrix C if each element of X; converges in probability 
to the corresponding element of C. 

More generally, if {X,} and {Y;} are sequences of (m x n) matrices, we will 
use the notation 


Xr Yr 


to indicate that the difference between the two sequences converges in probability 
to zero: 


X,- Y;> 0. 


An example of a sequence of random variables of interest is the following. 
Suppose we have a sample of T observations on a random variable {Y,, Y2,..., 
Y;}. Consider the sample mean, 


Y,=(1UT) > Y,, [7.1.3] 


as an estimator of the population mean, 

Br = Yr. 
We append the subscript T to this estimator to emphasize that it describes the 
mean of a sample of size T. The primary focus will be on the behavior of this 
estimator as T grows large. Thus, we will be interested in the properties of the 
sequence {47}7.1. 

When the plim of a sequence of estimators (such as {fi;}7..1) is equal to the 
true population parameter (in this case, 4), the estimator is said to be consistent. 
If an estimator is consistent, then there exists a sufficiently large sample such that 
we can be assured with very high probability that the estimate will be within any 
desired tolerance band around the true value. 


7.1. Review of Asymptotic Distribution Theory 181 


The following result is quite helpful in finding plims, a proof of this and some 
of the other propositions of this chapter are provided in Appendix 7.A at the end 
of the chapter. 


Proposition 7.1: Let {X} denote a sequence of (n X 1) random vectors with plim 
c, and let g(c) be a vector-valued function, g:, R" — R”, where g(-) is continuous 
at c and does not depend on T. Then 2(X7)—> g(c). 


The basic idea behind this proposition is that, since g(-) is continuous, g(X;) 
will be close to g(c) provided that X; is close to e. By choosing a sufficiently large 
value of 7, the probability that X, is close to ¢ (and thus that g(X_) is close to 
g(c)) can be brought as near to unity as desired. 

Note that g(X ,) depends on the value of X; but cannot depend on the index 
T itself. Thus, g(X7, T) = T-X?-is not a function covered by Proposition 7.1. 


Example 7.1 

If Xr 4 c, and Xj7 4 co, then (Xy7 + Xr) 5 (c, + C2). This follows 
immediately, since g(X,7, X27) = (Xir + Xoz) is a continuous function of 
(Xi, X47). 


Example 7.2 

Let {X,;} denote a sequence of (n x n) random matrices with Xr Cy, a 
nonsingular matrix. Let X27 denote a sequence of (mn x 1) random vectors 
with X,;-> c). Then [X;7]~!K.7-> [C,]~ 'e). To see this, note that the elements 
of the matrix [X,7]~! are continuous functions of the elements of Xj, at 
X,7 = Cj, since [C,]~' exists. Thus, [X,7]~!—> [C,]~!. Similarly, the elements 
of [X,7]~!X,7 are sums of products of elements of [X,7]~! with those of Xr. 
Since each sum is again a continuous function of X,7 and X,7, 


plim [X,7]~'X.7 = [plim X,7]~! plim X27 = [C,]~'ep. 


Proposition 7.1 also holds if some of the elements of X7 are deterministic 
with conventional limits as in expression [7.1.1]. Specifically, let XP = (Xz, Chr), 
where X,7 is a stochastic (nm, x 1) vector and c27 is a deterministic (m2 X 1) vector. 
If plim X,7 = ¢; and limz,. G7 = ¢2, then g(X,7, €27) > gc}, €2). (See Exer- 
cise 7.1.) 


Example 7.3 

Consider an alternative estimator of the mean given by Yt = [1/(T - 1)] x 
=2.,Y,. This can be written as c,7Y,, where c,z = [T/T - 1)] and Y; = 
(1/T) 22 ,Y,. Under general conditions detailed in Section 7.2, the sample 
mean is a consistent estimator of the population mean, implying that Y7S wp. 
It is also easy to verify that cir > 1. Since ¢7Yr is a continuous function of 
¢,rand Yz, it follows that (71Y¥r> 1-y = pm. Thus, Y#, like Y;, is a consistent 
estimator of yw. 


Convergence in Mean Square and Chebyshev’s Inequality 


A stronger condition than convergence in probability is mean square convergence. 
The random sequence {X7} is said to converge in mean square to c, indicated as 


mmn.S. 
X;> ¢, 


182 Chapter 7 | Asymptotic Distribution Theory 


if for every ¢ > 0 there exists a value N such that, for all T= N, 
E(X; - cP <e. [7.1.4] 
Another useful result is the following. 


Proposition 7.2: (Generalized Chebyshev’s inequality). Let X be a random variable 
with E(|X|’) finite for some r > 0. Then, for any & > 0 and any value of c, 


|X - el 


PX -c>a<= = [7.1.5] 


An implication of Chebyshev’ 's inequality is that if X; 7S ¢, then X;7 > c. To 
see this, note that if X; *> c, then for any e > 0 and & > 0 there exists an N such 
that E(X; — c)* < 67e for all T = N. This would ensure that 


E(X; - cy 
62 
for all T = N. From Chebyshev’s inequality, this also implies 
P{|X; - cl > d}<e 
for all T= N, or that X;-— c. 


Law of Large Numbers for Independent 
and Identically Distributed Variables 


Let us now consider the behavior of the sample mean Yr = (VT)ZZ,Y, 
where {Y,} is i.i.d. with mean py and variance o?. For this case, Y; has expectation 
pw and variance 


E(¥, - wu)? = (1/T?) var( v,) = (1/T?) - Var(Y,) = 0°/T. 


Since o?/T -» 0 as T—> ~, this means that Y; > y, implying also that Y;—> y. 

Figure 7.1 graphs an example of the density of the sample mean f7,(¥7) for 
three different values of T. As T becomes large, the density becomes increasingly 
concentrated in a spike centered at pw. 

The result that the sample mean is a consistent estimate of the population 
mean is known as the law of large numbers.' It was proved here for the special 
case of i.i.d. variables with finite variance. In fact, it turns out also to be true of 
any sequence of i.i.d. variables with finite mean y.? Section 7.2 explores some of 
the circumstances under which it also holds for serially dependent variables with 
time-varying marginal distributions. 


Convergence in Distribution 


Let {X}#., be a sequence of random variables, and let Fx,(x) denote the 
cumulative distribution function of X;. Suppose that there exists a cumulative 
distribution function F(x) such that 


lim Fy (x) = Fy(x) 
Tou 


'This is often described as the weak law of large numbers. An analogous result known as the strong 
law of large numbers refers to almost sure convergence rather than convergence in probability of the 
sample mean. 


?This is known as Khinchine’s theorem. See, for example, Rao (1973, p. 112). 


7.1. Review of Asymptotic Distribution Theory 183 


FIGURE 7.1 Density of the sample mean for a sample of size T. 


at any value x at which F,(-) is continuous. Then X; is said to converge in distri- 
bution (or in law) to X, denoted 


Xp Xx, 


When F(x) is of a common form, such as the cumulative distribution function for 
a N(w, o2) variable, we will equivalently write 


Xx T 5 N (u, a”). 
The definitions are unchanged if the scalar X; is replaced with an (m x 1) 


vector X,. A simple way to verify convergence in distribution of a vector is the 
following? If the scalar (A, X17 + AzXgr + +++ + A,X,7) converges in distribution 


to (A,X, + AX, + +++ + A,X,,) for any real values of (A;, A2,..., A,), then 
the vector X; = (X,7, Xo7,..-, X,7)’ converges in distribution to the vector 
Km (Xyy Mays oc HD’ 


The following results are useful in determining limiting distributions.* 


Proposition 7.3: é 

(a) Let {Y¥;} be a sequence of (n xX 1) random vectors with Y; — Y. Suppose 
that {X7} is a sequence of (n X 1) random vectors such that (X; — Yr) —> 0. 
Then X7 — Y; that is, X; and Y, have the same limiting distribution. 

(b) Let {X7} be a sequence of random (n x 1) vectors with X7 -—> ¢, and let {¥;} 
be a sequence of random (n X 1) vectors with Y; + Y. Then the sequence 
constructed from the sum {X, + Y7} converges in distribution toc + ¥ and 
the sequence constructed from the product {KX Yr} converges in distribution 
to c’Y. 

(c) Let {KX} be a sequence of random (n X 1) vectors with X = X, and let g(X), 
g: R"—> R" be a continuous function (not dependent on T). Then the sequence 
of random variables {g(X 7)} converges in distribution to g(X). 


>This is known as the Cramér-Wold theorem. See Rao (1973, p. 123). 
“See Rao (1973, pp. 122-24). 


184 Chapter 7 | Asymptotic Distribution Theory 


FIGURE 7.2 Density of VT(¥r — yu). 


Example 7.4 

Suppose that X,-5 c and vo Y, where Y ~ N(u, o*). Then, by Proposition 
7.3(b), the sequence X,Y; has the same limiting probability law as that of c 
times a N(, 02) variable. In other words, X;Y7 —> N(cu, c2o?), 


Example 7.5 
Generalizing the previous result, let {X;} be a sequence of random (m x n) 


matrices and {Y,} a sequence of random (n x 1) vectors with X; + Cand 
Y;— Y, with Y ~ N(p, ). Then the limiting distribution of X,Y; is the 
same as that of CY; that is, X,Y; N(Cp, CQC’), 


Example 7.6 

Suppose that X. oo N(O, 1). Then Proposition 7.3(c) implies that the square 

of X; asymptotically behaves as the square of a N(0, 1) variable: X} 4 
x°(1). 


Central Limit Theorem 


We have seen that the sample mean Y; for ani.i.d. sequence has a degenerate 
probability density as T — ©, collapsing toward a point mass at 4 as the sample 
size grows. For statistical inference we would like to describe the distribution of 
Y;, in more detail. For this purpose, note that the random variable VT(¥7 — 1) 
has mean zero and variance given by (VT)? Var(Yr) = o? for all 7, and thus, in 
contrast to Y;, the random variable VT(Y; — 4) might be expected to converge 
to a nondegenerate random variable as T goes to infinity. 

The central limit theorem is the result that, as T increases, the sequence 
VT(Y7 — ») converges in distribution to a Gaussian random variable. The most 
familiar, albeit restrictive, version of the central limit theorem establishes that if 
Y, is i.i.d. with mean y and variance o?, then’ 


VI(Yr ~ u) > N(O, 0). [7.1.6] 


Result [7.1.6] also holds under much more general conditions, some of which are 
explored in the next section. 
Figure 7.2 graphs an example of the density of V7(Y7 — ) for three different 


5See, for example, White (1984, pp. 108-9). 


7.1, Review o Asymptotic Distribution Theory 185 


values of T. Each of these densities has mean zero and variance o?. As T becomes 
large, the density converges to that of a N(O, o*) variable. 
A final useful result is the following. 


Proposition 7.4; Let {X,} be a sequence of random (n x 1) vectors such that 
VT(X7 — ¢) > X, and let g: R" > R” have continuous first derivatives with G 
denoting the (m X n) matrix of derivatives evaluated at c: 


= 38 
6x’ |. 


G 


c 


Then VT[g(Xr) — g(c)] > GX. 


Example 7.7 

Let {Y,, Y2,..., Y;} be ani.id. sample of size T drawn from a distribution 
with mean » + 0 and variance o?. Consider the distribution of the reciprocal 
of the sample mean, S; = 1/Y;, where Y,;= (1/T)27.,Y,. We know from 
the central limit theorem that VT(Y; — ) > Y, where Y ~ N(O, 2). Also, 
g(y) = 1/y is continuous at y = pw. Let G = (ag/ay)|,.,. = (—1/p2). Then 


VI[Sr — (1/n)] 5 G-Y; in other words, VT[Sr -— (1/y)] + N(O, 67/4), 


7.2. Limit Theorems for Serially Dependent Observations 


The previous section stated the law of large numbers and central limit theorem for 
independent and identically distributed random variables with finite second mo- 
ments. This section develops analogous results for heterogeneously distributed 
variables with various forms of serial dependence. We first develop a law of large 
numbers for a general covariance-stationary process. 


Law of Large Numbers for a Covariance-Stationary Process 


Let (Y;, Y2, ..., Yr) represent a sample of size T from a covariance- 
stationary process with 
E(Y, =p ~~ forallt [7.2.1] 
E(Y, — w)(Y¥,-; -— ») = y, for allt [7.2.2] 
> ly <. [7.2.3] 
iz 


Consider the properties of the sample mean, 
2 T 
¥,= (7) 5. ¥: [7.2.4] 
t=] 


Taking expectations of [7.2.4] reveals that the sample mean provides an unbiased 
estimate of the population mean, 


E(Y7) = B, 


186 Chapter 7 | Asymptotic Distribution Theory 


while the variance of the sample mean is 


E(Y; — ny 
= slam yo- »| 
= (I/P)E((Y, — #) + (Y2 —- we) +--+ + (¥r- wD] 
x [(¥, — uw) + (Yo - vw) +--+ + (¥7 - 
= (V/T*)E((Y, — w)(%1 — #) + (% — w) +--+ + (Yr - »)] 
+(¥%. - wl - 4) + (i - we) +--+ + (¥r - 
+ (¥3 — w)(%i — w) + (Yn -— we) ts + (Ye - Bw] 
teee + (¥p - wh -— 4) + (2 - we) +--+ + (¥7 - Bw} 
= (VT) {yt n+ t+ yw tee: + yea 
t[n+vmtntwte: + yr-r] 
tIntuntrwtnto + t+ yrsl 
tte t [Yr + yr-2 + Yr-3 + °° + Yl}. 
Thus, 


E(Y; — w) = (1/T*{Ty + AT - 1), 
+ 2T - 2)y, + 2(T — 3)¥3 + +++ + 2yz-1} 
or 
E(Y7 — gw)? = (WV/T){y + [(T - 1)/T](2y,) + [(T - 2)/T] (272) 
+ [(T ~ 3)/T](2ys) + +++ + [1/T](2yz_1)}- 


It is easy to see that this expression goes to zero as the sample size grows— 
that is, that Y;— yp: 


[7.2.5] 


T-E(Yr ~ uw)? = yo + [(T - 1)/T](2n) + [(T - 2)/T](2¥2) 
+ (EST ys) 2 [1/T](2yr_1)} 
= {lyol + (7 - 1/T]}-2ly,| + (7 - 2/7] 2ly2| [7.2.6] 
+ [(T— 3)/T] yal + + + - + [1/7] 2)yr- a} 
= {vol + ly] + 2lyo] + ys] +--+}: 
Hence, T-E(Y; — y)? < , by [7.2.3], and so E(Y; — yu)? > 0, as claimed. 
It is also of interest to calculate the limiting value of T-E(Y; — u)?. Result 
[7.2.5] expresses this variance for finite T as a weighted average of the first T~ 1 
autocovariances y;. For large j, these autocovariances approach zero and will not 


affect the sum. For small j, the autocovariances are given a weight that approaches 
unity as the sample size grows. Thus, we might guess that 


lim TE(Y; - wP = DY y= wt 2y,+2y4+2y%+°-°. [7.2.7] 


Tmox jens 


This conjecture is indeed correct. To verify this, note that the assumption [7.2.3] 
means that for any e > 0 there exists a q such that 


Alyqeil + 2lygval + Alygeal + -°- < 6/2. 


7.2. Limit Theorems for Serially Dependent Observations 87 


Now 


Dy - TEP: - uP 


j=-= 


= {yo + 2y, + 2% + 2y3 + +++} 
— {% + [((T - 1)/T]:2y, + [(T — 2)/T]-2% 
+ [(T — 3)/T]-2y5 + +++ + [1/7] 27H 
= (1/T)-2y,| + (2/T)-2y| + (3/T)-2lys] + +> - 
+ (q/T)-2|y,| + 2l¥_ +11 + 2l¥q +21 + 2lyq+al a es 
= (1/T)-2]y,| + (2/T)-2\y| + (3/T)-2lys] + - -- 
+ (q/T)-2|yg| + €/2. 
Moreover, for this given q, we can find an N such that 
(1/T)-2|y,| + (2/T)-2|y2] + (3/T)-2lys] + +++ + (G/T)-2ly,| < €/2 
for all T = N, ensuring that 
2 — TE - wy) <a 


as was to be shown. 
These results can be summarized as follows. 


Proposition 7.5: Let Y, be a covariance-stationary process with moments given by 
[7.2.1] and [7.2.2] and with absolutely summable autocovariances as in [7.2.3]. Then 
the sample mean [7.2.4] satisfies 


(a) Yr7u 


(6) lim (TEP, ~ w= Dy 


Toe 


Recall from Chapter 3 that condition [7.2.3] is satisfied for any covariance- 
stationary ARMA(p, q) process, 


(1- dL-¢L?----- 6, L)¥,= m+ (1+ OL + OL? +--+ +6,L%e, 
with roots of (1 — ¢,z — $22? ~ --- — ¢,z?) = 0 outside the unit circle. 
Alternative expressions for the variance in result (b) of Proposition 7.5 are 


sometimes used, Recall that the autocovariance-generating function for Y, is defined 
as 


8y(z) = ents 
while the spectrum is given by 


Sy(w) = = ev(en*). 


188 Chapter 7 | Asymptotic Distribution Theory 


Thus, result (b) could equivalently be described as the autocovariance-generating 
function evaluated at z = 1, 


es 


: 2% fa 8y(1), 


j=- 


or as 27 times the spectrum at frequency w = 0, 


= 


Dd y, = 20sy(0), 


i= —x 


the last result coming from the fact that e® = 1. For example, consider the MA(~) 
process 


Y,=mt+ 2, vier) =H + o(L)e, 
es 


with E(e,e,) = o? if t = 7 and 0 otherwise and with 27 9|y,| < ©. Recall that its 
autocovariance-generating fucntion is given by 


8y(z) = ¥(z)o7Y(z7"). 


Evaluating this at z = 1, 


o 


2 = VOW) = O71 + bit de tbat P. [7.2.8] 


j= 


Martingale Difference Sequence 


Some very useful limit theorems pertain to martingale difference sequences. 
Let {Y,}%., denote a sequence of random scalars with E(Y,) = 0 for all ¢.6 Let O, 
denote information available at date t, where this information includes current and 
lagged values of Y.” For example, we might have 


, = {¥,, Yi-s,- ene Yiu Xe Xa - 7 . Xi} 
where X, is a second random variable. If 
E(Y,|O,_,) = 0 fort = 2,3,..., [7.2.9] 


then {Y,} is said to be a martingale difference sequence with respect to {1,}. 
Where no information set is specifies, , is presumed to consist solely of 
current and lagged values of Y: 


QO, = {Y,, be rrr Y,}. 
Thus, if a sequence of scalars {Y,}/., satisfied E(Y,) = 0 for all t and 
E(Y¥¥,-1,¥:-2-.»»¥1) =0, [7.2.10] 
for t = 2, 3, ..., then we will say simply that {Y,} is a martingale difference 


sequence. Note that [7.2.10] is implied by [7.2.9] by the law of iterated expectations. 
A sequence of (n x 1) vectors {Y,}=, satisfying E(Y,) = 0 and E(Y,|Y,_1, 
Y,-2,...,Y,) = 0 is said to form a vector martingale difference sequence. 


‘Wherever an expectation is indicated, it is taken as implicit that the integral exists, that is, that 
E|Y| is finite. ; 

7More formally, {0,}%, denotes an increasing sequence of o-fields (Q,_, C 0) with Y, measurable 
with respect to 0,. See, for example, White (1984, p. 56). 


7.2. Limit Theorems for Serially Dependent Observations 189 


Note that condition [7.2.10] is stronger than the condition that Y, is serially 
uncorrelated. A serially uncorrelated sequence cannot be forecast on the basis of 
a linear function of its past values. No function of past values, linear or nonlinear, 
can forecast a martingale difference sequence. While stronger than absence of 
serial correlation, the martingale difference condition is weaker than independence, 
since it does not rule out the possibility that higher moments such as E(Y?2|Y,_1, 
Y,_2,..., ¥,) might depend on past Y’s. 


Example 7.8 
If e, ~ iid. N(O, 07), then Y, = e,e,_, is a martingale difference sequence 
but not serially independent. 


L!-Mixingales 

A more general class of processes known as L!-mixingales was introduced by 
Andrews (1988). Consider a sequence of random variables {Y,}7_, with E(Y,) = 
Ofort = 1,2,.... Let, denote information available at time t, as before, where 
Q, includes current and lagged values of Y. Suppose that we can find sequences of 
nonnegative deterministic constants {c,};, and {€,,}%,+o such that lim,,,. &, = 0 
and 


E lev 0,» =CEn [7.2.11] 


for all t= 1 and all m = 0. Then {Y,} is said to follow an L!-mixingale with respect 
to {0,}. 

Thus, a zero-mean process for which the m-period-ahead forecast E(Y,|0,_) 
converges (in absolute expected value) to the unconditional mean of zero is de- 
scribed as an L!-mixingale. 


Example 7.9 
Let {Y,} be a martingale difference sequence. Let c, = E|Y,|, and choose 


& = land &, = Ofor m = 1, 2,.... Then [7.2.11] is satisfied for 0, = 
{Y,, Yi-1.-+, Yi}, so that {Y¥,} could be described as an L'-mixingale 
sequence, — 


Example 7.10 

Let Y, = DLopje,-;, where ZF_o|y,| < © and {e,} is a martingale difference 
sequence with Ele,| <M for all tfor some M < ~, Then {Y,} is an L'-mixingale 
with respect to 2, = {e;,, €,;, . . .}. To see this, notice that 


=F ¥ Wiel} 


Since {y;}f.o is absolutely summable and E|¢,_;| < M, we can interchange the 
order of expectation and summation: 


E 


E(¥|€:-m>€1—m—19° * , = E D> WiEe—j 
j=m 


| 3 Weil = Sls Blends 5 oko 
=m =m j=m 


Then [7.2.11] is satisfied with c, = M and &, = =%,,|p,|. Moreover, 
lim,,.. €, = 0, because of absolute summability of {y,}7_9. Hence, {Y,} is an 
L!-mixingale. 


190 Chapter 7 | Asymptotic Distribution Theory 


Law of Large Numbers For L}-Mixingales 
Andrews (1988) derived the following law of large numbers for L'-mixingales.® 


Proposition 7.6: Let {Y,} be an L'-mixingale. If (a) {Y,} is uniformly integrable 
and (b) there exists a choice for {c,} such that 


T 
lim (1/T) >. ¢, < ®, 


To tml 


then (1/T)EZ.,Y,—> 0. 


To apply this result, we need to verify that a sequence is uniformly integrable. 
A sequence {Y,} is said to be uniformly integrable if for every e > 0 there exists a 
number c > 0 such that 


E(\¥|-8yyjeq) < € [7.2.12] 


for all t, where djyj24 = 1 if |Y,| = c and 0 otherwise. The following proposition 
gives sufficient conditions for uniform integrability. 


Proposition 7.7: (a) Suppose there exist an r > 1 and an M' < © such that 
E(\¥ |") < M’ for all t. Then {Y,} is uniformly integrable. (b) Suppose there exist an 
r >1 and an M' < & such that E(|X,’) < M' for allt. If Y, = 37. -2h,X,_; with 
SH. -x |h,| < %, then {Y,} is uniformly integrable. 


Condition (a) requires us to find a moment higher than the first that exists. 
Typically, we would use r = 2. However, even if a variable has infinite variance, it 
can still be uniformly integrable as long as E|Y{’ exists for some r between 1 and 2. 


Example 7.11 

Let Y, be the sample mean from a martingale difference sequence, Yr = 
(1/T)ZZ, Y, with E|Y|’ < M’' for some r > 1 and M’ <~. Note that this also 
implies that there exists an M < © such that E|Y| < M. From Proposition 
7.7(a), {¥,} is uniformly integrable. Moreover, from Example 7.9, {Y,} can be 
viewed as an L'-mixingale with c, = M. Thus, lim;_.. (1/T)27.,¢, = M<™, 
and so, from Proposition 7.6, Y;—> 0. 


Example 7.12 © 

Let Y, = DRoje,-;, where D7 o|y,| < ~ and {e,} is a martingale difference 
sequence with Ele," <M‘ < © for some r > 1 and some M’ < ~. Then, from 
Proposition 7.7(b), {Y,} is uniformly integrable. Moreover, from Example 7.10, 
{Y,} is an L'-mixingale with c, = M, where M represents the largest value of 
Ele for any t. Then lim;.,.. (1/T)22,c, = M < ©, establishing again that 
Y;—> 0. 


Proposition 7.6 can also be applied to a double-indexed array {Y,,7}; that is, 
each sample size T can be associated with a different sequence {Y,.7, Ya7,..-, 
Y;7}. The array is said to be an L'-mixingale with respect to an information set 
O, 7 that includes {Y,,7, Y2,7,.--, Yr7} if there exist nonnegative constants é,, 
and c, 7 such that lim,,.. §, = 0 and 


E\E(Y,71Q, -m.7)| = C1,7&m 


"Andrews replaced part (b) of the proposition with the weaker condition Iimz.. (1/T) 22, ¢, < ™, 
See Royden (1968, p. 36) on the relation between “lim” and “lim.” 


7.2. Limit Theorems for Serially Dependent Observations 191 


for all m = 0, T= 1, and¢ = 1, 2, , ©. If the array is uniformly integrable 
with lim,.,. (/T)3Z.,¢,7 < ™, then (1/T)32,Y,75 0. 
Example 7.13 
Let {e,}%., be a martingale difference sequence with Ele,|) < M’ for some 
r>1andM' <~, and define Y, = (#/T)e,. Then the array {Y,,7} is a uniformly 


integrable L}-mixingale with c,7 = M, where M denotes the maximal value 
for Ele], & = 1, and &, = 0 for m > 0. Hence, (1/T)27.,(t/T)e, > 0. 


Consistent Estimation of Second Moments 


Next consider the conditions under which 
z Pp 
(UT) & YY, 4 E(VY 0) 
P= 


(for notational simplicity, we assume here that the sample consists of T + k 
observations on Y). Suppose that Y, = TjeoWje,-;, where TfLoly| < & and {e,} is 
an i.i.d. sequence with Ele’ < » for some r > 2. Note that the population second 
moment can be written? 


E(Y,Y,-«) = (3, a) oI a) 
= (3, > dbitrutr-rs) [7.2.13] 
u=O0 v=0 
=> > Yul EE, ube—4-1)- 


u=0 y=0 
Define X,, to be the following random variable: 


Xk = Y,Y.-« E(Y.Y,-«) 


(3 ee (35 x, Ca o) 


It 


u=O v= 2=0 v=0 
+ , 2, Wut, [€,~ uEr-k-v — E(&~1&-4~-»))- 
Consider a forecast of X,, on the basis of Q,—m = {€,—jn) &:-m-1» +. -} form > k: 


E(X,1Q,- m) a > 3 puebel eeu or — E(e_.&-4-v)]- 


Bot 
9Notice that 
> D Wael = D lel D lel <o 
ux y=0 u=0 vad 


and Ele,_,€,-.-,| < ©, permitting us to move the expectation operator inside the summation signs in 
the last line of [7.2.13]. 


192 Chapter 7 | Asymptotic Distribution Theory 


The expected absolute value of this forecast is bounded by 


> > Wither ubr- kev aa E(&,_u€,-x-v)] 


usm vem—k 


Ss e( = Dy CG a - Beut-1-0) 


| EKal am = E 


u=am v=m—k 


4 es 


= DD lbulM 


u=my=m—k 


for some M < ~, Define 


= Dd Wel = Dll SD lvl 


“u=mvem 


Since {y}j. is absolutely summable, lim,,,. 27.,|¥,| = 0 and lim,,.. &, = 0. 
It follows that X,, is an L'-mixingale with respect to 0, with coefficient c, = M. 
Moreover, X,, is uniformly integrable, from a simple adaptation of the argument 
in Proposition 7.7(b) (see Exercise 7.5), Hence, 


T T 
(UT) B Xin = (UT) D [¥Nee — BLY) > 0, 
from which 


T 
(/T) D Y,Y,-.% E(YY,-.)- [7.2.14] 
t=1 


It is straightforward to deduce from [7.2.14] that the jth sample autocovariance 
for a sample of size T gives a consistent estimate of the population autocovariance, 


(UT) > (he = Fhe Vz) > E(¥, - w)(¥,-e — a)» [7.2.15] 


where Y; = (1/T)=7.1Y,; see Exercise 7.6. 


Central Limit Theorem for a Martingale 
Difference Sequence 


Next we consider the asymptotic distribution of VT times the sample mean. 
The following version of the central limit theorem can often be applied. 


Proposition 7.8: (White, 1984, Corollary 5.25, p. 130). Let {Y,}%., be a scalar 
martingale difference sequence with Y; = (1/T)=1.,Y,. Suppose that (a) E(Y?) = 

ao? > 0 with ges >oa*>0, () E|Y,!' < for some r > 2 and all t, and 
(c) (V/T)EZL,Y?4 o?. Then VT Y¥;—5 N(O, 02). 


Again, Proposition 7.8 can be extended to arrays {Y,7} as follows. Let 
{Y, 7}, be a martingale difference sequence with E(Y¥2;) = o7; > 0. Let 
{Y,.7+i}24' be a potentially different martingale difference sequence with 
E(¥irs1) = o274, > 0. If (a) (1/T)E72. 4077 > o, (b) ELY,, nl < © for some 
r > 2and all t and T, and (c) (1/T)2,Y2,4 o?, then VT ¥;— (0, 0”). 

Proposition 7.8 also readily generalizes to vector martingale difference 
sequences, 


7.2. Limit Theorems for Serially Dependent Observations 193 


Proposition 7.9: Let {Y,}7., be an n-dimensional vector martingale difference se- 
quence with ¥, = (1/T)2Z.,Y,. Suppose that (a) E(¥,Y/) = Q, a positive definite 
matrix with (1/ T)Zh1 Q,—> Q, a positive definite matrix; (b) E(¥¥p¥u¥ md) < © 
for all t and alli, j, b and m (including i = me ae = 1 = m), where Y,, is the ith element 
of the vector Y,; and (c) (1/T)22.,Y,¥; > Q. Then VT ¥;—> N(O, Q). 


Again, Proposition 7.9 holds for arrays {Y, 7}7_, satisfying the stated conditions. 
To apply Proposition 7.9, we will often need to assume that a certain process 
has finite fourth moments. The following result can be useful for this purpose, 


Proposition 7.10: Let X, be a strictly stationary stochastic process with E(X#) = 
By < 0. Let ¥, = Dfigh)X,_;, where Z7o|h,| < ~. Then Y, is a strictly stationary 
stochastic process with ELY.Y, Y,,Y,| < © for all t, s, u, and v. 


Example 7.14 

Let Y, = 6:¥i-1 + $2:¥,-2 + -°+ + GpY¥,-p + &, where {e,} is an iid. 

sequence and where roots of (1 — gz — @,.z7 — -- + — $,z”) = O lie outside 

unit circle. We saw in Chapter 3 that Y, can be written as 27 9i¢,_, with 
Sj oly] < ©. Proposition 7.10 states that if e, has finite fourth moments, then 

so does Y,, 


Example 7.15 

Let Y, = [foWje-, with Droly] < © and ¢, iid. with E(e,) = 0, E(e?) = 
o?, and E(e#) < ~, Consider the random variable X, defined by X,™ 6. Yi 
for k > 0. Then X, is a martingale difference sequence with variance E(X2) = 
o?- E(Y?) and with fourth moment E(e4):E(Y#) < ©, by Example 7.14. Hence, 
if we can show that 


T 
(W/T) > X24 E(X?), [7.2.16] 
t= 
then Proposition 7.8 can be applied to deduce that 


(VT) > x r(0, EU) 


or 


T 
QVT) > 6 Y-~ > (0, EY). [7.2.17] 
t=1 
To verify [7.2.16], notice that 


T T 
V/ X?2 = (I/T ey 72 
(UT) 2 Xi = (UT) Dees [7.2.18] 
T T 
= (UT) D (e? - 0) ¥2, + (UT) D 0?Y2,. 
t=1 t= 


But (ce? — o*)Y?_, is a martingale difference sequence with finite second mo- 
ment, so, from Example 7.11, 


T 
(1/T) > (6? — 0?)¥?., 40. 
t=1 


194 Chapter 7 | Asymptotic Distribution Theory 


It further follows from result [7.2.14] that 
T 
(V/T) 3 0?¥?_, > o?-E(¥3). 
t=1 
Thus, [7.2.18] implies 
T 
(/T) > X? 5 o?-E(¥}), 
t=1 


as claimed in [7.2.16]. 


Central Limit Theorem for Stationary Stochastic Processes 


We now present a central limit theorem for a serially correlated sequence. 
Recall from Proposition 7.5 that the sample mean has asymptotic variance given 
by (1/T)27. 2. Thus, we would expect the central limit theorem to take the 
form VT(Y; — ») > M0, S_«¥,). The next proposition gives a result of this 
type. 


Proposition 7.11: (Anderson, 1971, p. 429). Let 
Y,=pt >» WEr-j> 
j=0 


where {¢,} is a sequence of i.i.d. random variables with E(e?) < © and 3} olyj| < 
©, Then 


VI¥r - w) > NO, DE [7.2.19] 


A version of [7.2.19] can also be developed for {e,} a martingale difference 
sequence satisfying certain restrictions; see Phillips and Solo (1992). 


APPENDIX 7.A. Proofs of Chapter 7 Propositions 


@ Proof of Proposition 7.1. Let g,(c) denote the jth element of g(c), g,: Ry > R'. We 
need to show that for any 6 > 0 and « > 0 there exists an N such that for all T = N, 


Pilg(X7) — g(0)| > d}< e. (7.A.1] 
Continuity of g,(-) implies that there exists an 7 such that |g,(X7r) — g,(c)| > 6 only if 
[(Xir — &)? + (Kar — G2)? t+ + (Kar — Cn)*] > 7. (7.A.2] 


This would be the case only if (X,- — c,)* > n/n for some i. But from the fact that plim 
Xr = ¢;, for any i and specified values of ¢ and 7 we can find a value of N such that 


P{\Xir — ¢| > n/Vn} < eln 
for all T = N. 
Recall the elementary addition rule for the probability of any events A and B, 


P{A or B} = P{A} + P{B}, 
from which it follows that 


P{(\Xir — ¢\| > n/Vn) or (Xor — €:] > n/Van) or «++ or (Xap — Gal > nl Va)} 


< (e/n) + (e/n) + ++ + (e/n). 
Hence, 


PU(Xr - 64)? + (Kar — PP t+ + (Lar — GP) > P< 


Appendix 7.A. Proofs of Chapter 7 Propositions 195 


for all T = N. Since [7.4.2] was a necessary condition for |g,(X;7) — g,(c)| to be greater 
than 4, it follows that the probability that |g,(X;) — g,(c)| is greater than 6 is less than e, 
which was to be shown. 


m Proof of Proposition 7.2. Let S denote the set of all x such that |x — ¢| > 6, and let $ 
denote its complement (all x such that |x — c| = 5). Then, for f,(x) the density of x, 


EIX ~ dr = [lx — hfe) ax 
= [le - hfe) de + |e — cbf ley ax 
= [be — cbf) ax 


= [of eG) ax 


= 8°P{|X — cl > 4}, 
so that 
E|X — cl = 8'P{|X —- ¢| > 8}, 
as claimed. m™ 
@ Proof of Proposition 7.4. Consider any real (m x 1) vector A, and form the function 
h: R" — R’ defined by A(x) = A’g(x), noting that A(-) is differentiable. The mean-value 


theorem states that for a differentiable function h(-), there exists an (m x 1) vector ¢; 
between X; and c such that” 


W%) — We) = BE) x (x, ~ 6) 
and therefore 
VT [A(X7) - h(c)] = ae x VT(X; — ©). [7.4.3] 


Since c; is between X; and c and since X;— c, we know that c; -> c, Moreover, the 
derivative ah(x)/dx' is itself a continuous function of x. Thus, from Proposition 7.1, 

dh(x) 2 h(x) 

ox’ ox’ 


x=Cr x=e 


Given that VT(X; — ¢) 45x, Proposition 7.3(b) applied to expression [7.A.3] gives 


VT th,) ~ A( 5 BO! x, 


or, in terms of the original function g(-), 


X. 


x=c 


dg(x 
NVT [e(%r) - a(S »’ 2D 
Since this is true for any A, we conclude that 


VT fe) - (0) 522) x, 


x=c¢ 


as claimed. @ 


‘That is, for any given X; there exists a scalar wz; with 0 S wr = 1 such that er = pX_ + 
(1 — wey)e. See, for example, Marsden (1974, pp. 174-75). 


196 Chapter 7 | Asymptotic Distribution Theory 


@ Proof of Proposition 7.7, Part (a) is established as in Andrews (1988, p. 463) using 
Hélder’s inequality (see, for example, White, 1984, p. 30), which states that for r > 1, if 
E({|Yt] < = and E[|W|""-] < ©, then 


ELYW| = (ELIYPD™ x {ElIW he oe 
This implies that 
E(I¥|-8yyjxe1) = {E{|¥|"]}" x {E[(Spy jee) ~ O}}e> Wr, (7.A.4] 


Since 6yyj2< is either 0 or 1, it follows that 


(Spree)? = Syyjacy 
and so 


EY, 
1fy,(y.) dy, = PilY,| = c} = ba (7.A.5] 


Y¥fze c 


El(Byrseed"-91 = E8irael = | 

where the last result follows from Chebyshev’s inequality. Substituting [7.4.5] into [7.4.4], 
EIY, (r-1)ir 

E(\Y,|-8pyna) <(eqvinx{ ; } [7.A.6] 


Recall that E[|Y,|"] < M’ for all ¢, implying that there also exists an M < ™ such that 
E\Y,| < M for all ¢. Hence, 

E(\¥|-8yyjzc)) = (M')" x (Micye-"", 
This expression can be made as small as desired by choosing c sufficiently large. Thus 


condition [7.2.12] holds, ensuring that {Y,] is uniformly integrable. 
To establish (b), notice that 


E(\¥|-8y yjxe1) =E 


Se Xbinea] SEL SMR ase} (7AM 
Since E[|X,_,'] < M’ and since 8yy)2.; = 1, it follows that EX|X,_,|-8yy,2c} is bounded. 


Since {h,}* ... is absolutely summable, we can bring the expectation operator inside the 
summation in the last expression of [7.4.7] to deduce that 


e| 3 WsI1%-1+8ya0a} = an lhl EAX,-,|-Spyjecyt 


i EIY, (r-l)r 
< 3 wenn x {AE 


where the last inequality follows from the same arguments as in [7.4.6]. Hence, [7.A.7] 
becomes ; 


oy EIY,| (r= lr 
E(\¥|-8yrj2q) = > Way] x (MYM x pa ; [7.4.8] 
jure c 


But certainly, E|Y,| is bounded: 


EJ 


ei Xi 


= 


= > AEX, = K<@. 


ju-= 


E|Y| = E 


Thus, from (7.A.8], 


E(\¥|-8pvjeq) S (M')'"(Kieye- |g. [7.4.9] 
jars 
Since 27 _.|h,| is finite, [7.4.9] can again be made as small as desired by choosing ¢ 
sufficiently large. ™ 


@ Proof of Proposition 7.9. Consider Y, = A’Y, for A any real (nm x 1) vector. Then Y, is 
a martingale difference sequence. We next verify that each of the conditions of Proposition 


Appendix 7.A. Proofs of Chapter 7 Propositions 197 


7.8 is satisfied. (a) E(Y?) = A’, = a? > 0, by positive definiteness of M,. Likewise, 
T Ua 
(1/T) > o? = NUT) > OA NOD & ?, 
ta} t=) 


with o? > 0, by positive definiteness of ©. (b) E(Y?) is a finite sum of terms of the form 
AAAAmECYcY:¥nYou) and so is bounded for all ¢ by condition (b) of Proposition 7.9; 
hence, Y, satisfies condition (b) of Proposition 7.8 for r = 4. (c) Define S; = (1/T) x 
m7, Y? andS;= (1/T)ZZ7,Y,Y;, noticing that S$; = A’S;A. Since S;is a continuous function 
of S;, we know that plim $; = A’'QA = oa, where O is given as the plim of S;. 
Thus Y, satisfies conditions (a) through (c) of Proposition 7.8, and so VT Y,— N(0, a”), 
or VT Y,5 d’Y, where Y ~ (0, Q). Since this is true for any A, this confirms the claim 
that VT Y,- N(0,Q). 


@ Proof of Proposition 7.10. Let Y = X,X, and W = X,X,. Then Hdlder’s inequality 
implies that for r > 1, ; 


E|X,X,X,X,| S$ {E|X, XP" x {E|X, Xe“ PHO”, 
For r = 2, this means 
E|X,X,X,X,| < {E(X,X,)*}"? x {E(X,X, 7}? Ss max{E(X,X,)?, E(X, X,)}. 
A second application of Hélder’s inequality with Y = X? and W = X? reveals that 
E(X,X,)? = E(X2.X}) = {E(X}) "x {E(X{) yen, 
Again for r = 2, this implies from the strict stationarity of {X,} that 
E(X,X,)? = E(X%). 
Hence, if {X,} is strictly stationary with finite fourth moment, then 
E|X,X,X,X,| Ss E(X}) = oy 


for all t, s, u, and v. 
Observe further that 


ELY,Y,Y,Y,| = E] >) A,X,-1 3, hy Xs) Xa Dm Xe m 
i=0 f=0 i=0 m=0 


De ke Ahhh Xi Xp Xu -tXy—m 


s =p) pap sap > [nd hobo KerXoXeel 


f=0 j=0 [=0 m=0 
But 
DD 2 ey Met Meltnl = De Ved DS Wil) Ved DW 


0 f=0 i=0 m=0 


<a 
and 
E|X,_,X,-;X-1Xv-ml < Ma 


for any value of any of the indices. Hence, 


x 


AY,Y.YLY I <> DDD heya es 


i=O s=0 (=0 m=0 
<o% @ 


Chapter 7 Exercises 
7.1, Let {X;} denote a sequence of random scalars with plim X; = ¢. Let {c;} denote a 


sequence of deterministic scalars with lim. c; = c. Let g: R? — R? be continuous at 
(é, c). Show that g(X;, ¢7) > g(€, ¢). 


198 Chapter 7 | Asymptotic Distribution Theory 


7.2. Let Y, = 0.8Y,_, + ¢, with E(e,e,) = 1 for ¢ = + and zero otherwise. 
(a) Calculate lim,.,. T-Var(Y7). = 
(b) How large a sample would we need in order to have 95% confidence that Y; 
differed from the true value zero by no more than 0.1? 
7.3. Does a martingale difference sequence have to be covariance-stationary? 
7.4, Let Y, = Zflopye,,, where Zoly| < © and {e,} is a martingale difference sequence 
with E(e?) = o%. Is Y, covariance-stationary? 


7.5. Define X14 7 Vi nodrno MWe E-n-v oa E(€,-u€:-x-»)] where é, is an iid. se- 
quence with Ele|" < M" for some r > 2 and M” < ~ with 37.4 |y| < ©. Show that X,, is 
uniformly integrable. 

7,6. Derive result [7.2.15]. 

7.7. Let Y, follow an ARMA(p, q) process, 

(l-¢L- $L?-—--+- L(Y, — w) = (1 + OL + OL? +--+ + O,L%e,, 
with roots of (1 — ¢.z — $227 -— -+ + — $,z") = Oand (1 + 6,z + @z7+-++-+ + 6,24) 
= 0 outside the unit circle. Suppose «, has mean zero and is independent of ¢, for t # T 
with E(e?) = o? and E(e?#) < @ for all ¢, Prove the following: 


@) and y,5u 


(0) [UT — &] S VY.4% EY... 


Chapter 7 References 


Anderson, T, W. 1971. The Statistical Analysis of Time Series. New York: Wiley. 


Andrews, Donald W. K. 1988. “Laws of Large Numbers for Dependent Non-Identically 
Distributed Random Variables.” Econometric Theory 4:458-67. 


Hoel, Paul G., Sidney C, Port, and Charles J. Stone. 1971. Introduction to Probability 
Theory. Boston: Houghton Mifflin. 


Marsden, Jerrold E. 1974, Elementary Classical Analysis. San Francisco: Freeman. 


Phillips, Peter C. B., and Victor Solo. 1992. ‘“‘Asymptotics for Linear Processes.” Annals 
of Statistics 20:971~-1001. 


Rao, C. Radhakrishna. 1973. Linear Statistical Inference and Its Applications, 2d ed. New 
York: Wiley. 


Royden, H. L. 1968. Real Analysis, 2d ed. New York: Macmillan. 
Theil, Henri. 1971. Principles of Econometrics. New York: Wiley. 


White, Halbert. 1984. Asymptotic Theory for Econometricians. Orlando, Fla.: Academic 
Press. 


Chapter 7 References 199 


Linear Regression Models 


We have seen that one convenient way to estimate the parameters of an auto- 
regression is with ordinary least squares regression, an estimation technique that 
is also useful for a number of other models. This chapter reviews the properties 
of linear regression. Section 8.1 analyzes the simplest case, in which the explanatory 
variables are nonrandom and the disturbances are i.i.d. Gaussian. Section 8.2 
develops analogous results for ordinary least squares estimation of more general 
models such as autoregressions and regressions in which the disturbances are non- 
Gaussian, heteroskedastic, or autocorrelated. Linear regression models can also 
be estimated by generalized least squares, which is described in Section 8.3. 


8.1. Review of Ordinary Least Squares with Deterministic 
Regressors and i.i.d. Gaussian Disturbances 


Suppose that a scalar y, is related to a (k x 1) vector x, and a disturbance term u, 
according to the regression model 


y= XB + u,. [8.1.1] 


This relation could be used to describe either the random variables or their real- 
ization. In discussing regression models, it proves cumbersome to distinguish no- 
tationally between random variables and their realization, and standard practice 
is to use small letters for either. 

This section reviews estimation and hypothesis tests about B under the as- 
sumptions that x, is deterministic and u, is i.i.d. Gaussian. The next sections discuss 
regression under more general assumptions. First, however, we summarize the 
mechanics of linear regression and present some formulas that hold regardless of 
Statistical assumptions. 


The Algebra of Linear Regression 


Given an observed sample (y;, y2,- . - , yr), the ordinary least squares (OLS) 
estimate of B (denoted b) is the value of B that minimizes the residual sum of 
squares (RSS): 


T 
RSS = D (y, — x/B)*. [8.1.2] 
t=] 


200 


We saw in Appendix 4.A to Chapter 4 that the OLS estimate is given by 
T -Irge 
+= [Zo] [Ee 9 


assuming that the (k x k) matrix 27,(x,x;) is nonsingular. The OLS sample 
residual for observation t is 


a, =y, — X:b. [8.1.4] 
Often the model in [8.1.1] is written in matrix notation as 
‘'y = XB + u, [8.1.5] 
where 
yi x} uy 
aie hs X = *2 i ag ug 
(Tx1) : (Tx) : (7x1) 
yr Xr ur 


Then the OLS estimate in [8.1.3] can be written as 


-1 


Xj y1 
X) y2 

b= 4 (x. X2°°° x7] : [x; X2°°* Xz] [8.1.6] 
Xr yr 


(X'X)~1X’y. 
Similarly, the vector of OLS sample residuals [8.1.4] can be written as 
@ = y — Xb = y — X(X’X)7!X’y = [I, — X(X'X)“?X']y = Myy, [8.1.7] 


where My is defined as the following (T x T) matrix: 


My =I, — X(X'X)71X’. [8.1.8] 
One can readily verify that My is symmetric: 
Mx = Mx; 
idempotent: 
MxMx = Mx; 


and orthogonal to the columns of X: 
MxX = 0. [8.1.9] 


Thus, from [8.1.7], the OLS sample residuals are orthogonal to the explanatory 
variables in X: 
a’X = y'MxX = 0". [8.1.10] 


The OLS sample residual (2,) should be distinguished from the population 
residual (u,). The sample residual is constructed from the sample estimate b 
(a, = y, — x;b), whereas the population residual is a hypothetical construct based 
on the true population value B (u, = y, — x; B). The relation between the sample 


8.1. Review of Ordinary Least Squares 201 


and population residuals can be found by substituting [8.1.5] into [8.1.7]: 
= Mx(XB + u) = Myu. [8.1.11] 


The difference between the OLS estimate b and the true population parameter 
B is found by substituting [8.1.5] into [8.1.6]: 


b = (X’X)~!X'[XB + u] = B + (X’X)~!X’u. [8.1.12] 


The fit of an OLS regression is sometimes described in terms of the sample 
multiple correlation coefficient, or R?. The uncentered R? (denoted R2) is defined 
as the sum of squares of the fitted values (x;b) of the regression as a fraction of 
the sum of squares of y: 


¥ (o'xx:b) ; , : 
Ria _bYX'Xb _ y'X(K'X) 1X’, (2.1.13) 


> y? yy yy 


If the only explanatory variable in the regression were a constant term (x, = 
1), then the fitted value for each observation would just be the sample mean y and 
the sum of squares of the fitted values would be Ty. This sum of squares is often 
compared with the sum of squares when a vector of variables x, is included in the 
regression. The centered R* (denoted R2) is defined as 


y'X(X'X)"IX'y — Ty? 
yy Ty 
Most regression software packages report the centered R? rather than the uncen- 
tered R?. If the regression includes a constant term, then R2 must be between zero 


and unity. However, if the regression does not include a constant term, then R? 
can be negative. 


R2= [8.1.14] 


The Classical Regression Assumptions 


Statistical inference requires assumptions about the properties of the explan- 
atory variables x, and the population residuals u,. The simplest case to analyze is 
the following. 


Assumption 8.1: (a) x, is a vector of deterministic variables (for example, x, might 
include a constant term and deterministic functions of t); (b) u, is i.i.d. with mean 
0 and variance a7; (c) u, is Gaussian. 


To highlight the role of each of these assumptions, we first note the impli- 
cations of Assumption 8.1(a) and (b) alone and then comment on the added im- 
plications that follow from (c). 


Properties of the Estimated OLS Coefficient Vector 
Under Assumption 8.1(a) and (b) 


In vector form, Assumption 8.1(b) could be written E(u) = 0 and E(uu’) = 
oe 2 T- 

Taking expectations of [8.1.12] and using these conditions establishes that b 
is unbiased, 


E(b) = B + (X’X)7'X'[E(u)] = B, [8.1.15] 


202 Chapter 8 | Linear Regression Models 


with variance-covariance matrix given by 

E[(b — B)(b — B)’] = E[(X'X)~*X'uu'X(X'X)~"] 
(X'X)"'X'[E(uw’)IX(X'X)"? 8.1.16] 
o?(X'X)~!X'X(X'X)~! 
o?(X'X)71. 

The OLS coefficient estimate b is unbiased and is a linear function of y. The 
Gauss- Markov theorem states that the variance-covariance matrix of any alternative 
estimator of B, if that estimator is also unbiased and a linear function of y, differs 
from the variance-covariance matrix of b by a positive semidefinite matrix.’ This 
means that an inference based on b about any linear combination of the elements 
of B will have a smaller variance than the corresponding inference based on any 


alternative linear unbiased estimator. The Gauss-Markov theorem thus establishes 
the optimality of the OLS estimate within a certain limited class. 


Properties of the Estimated Coefficient Vector 
Under Assumption 8.1(a) Through (c) 


When u is Gaussian, [8.1.12] implies that b is Gaussian. Hence, the preceding 
results imply 


b ~ N(B, o2(X'X)~!). [8.1.17] 


It can further be shown that under Assumption 8.1(a) through (c), no unbiased 
estimator of B is more efficient than the OLS estimator b.? Thus, with Gaussian 
residuals, the OLS estimator is optimal. 


Properties of Estimated Residual Variance 
Under Assumption 8.1(a) and (b) 


The OLS estimate of the variance of the disturbances a? is 
s? = RSS/(T — k) = GOT — k) = WMxMyu(T - k) [8.1.18] 


for Mx the matrix in [8.1.8]. Recalling that Mx is symmetric and idempotent, 
[8.1.18] becomes 


s? = uwMyu/(T — &). [8.1.19] 
Also, since Mx is symmetric, there exists a (T x T) matrix P such that? 
Mx = PAP’ [8.1.20] 
and 
P'P =I, [8.1.21] 


where A is a(T x T) matrix with the eigenvalues of Mx along the principal diagonal 
and zeros elsewhere. Note from [8.1.9] that Mxv = 0 if v should be given by one 
of the k columns of X. Assuming that the columns of X are linearly independent, 
the k columns of X thus represent & different eigenvectors of Mx each associated 


'See, for example, Theil (1971, pp. 119-20). 
See, for example, Theil (1971, pp. 390-91). 
3See, for example, O’Nan (1976, p. 296). 


8.1. Review of Ordinary Least Squares 203 


with an eigenvalue equal to zero. Also from [8.1.8], Mxv = v for any vector v 
that is orthogonal to the columns of X (that is, any vector v such that X’v = 0); 
(T — k) such vectors that are linearly independent can be found, associated with 
(T — k) eigenvalues equal to unity. Thus, A contains k zeros and (T — k) is along 
its principal diagonal. Notice from [8.1.20] that 


u’Myu = u’PAP’u 


= (P’u)’A(P'u) [8.1.22] 
= w’Aw 
= wid; + W2Ag + +> + WHAZ, 
where 
w= P’u. 
Furthermore, 


E(ww’) = E(P’uu’P) = P’E(uu’)P = o?P’P = oI 7. 


Thus, the elements of w are uncorrelated, with mean zero and variance a2. Since 
k of the A’s are zero and the remaining T — k are unity, [8.1.22] becomes 


u’Myu = wi? + w3 +--+ + Weg. [8.1.23] 
Furthermore, each w? has expectation o?, so that 
E(u'Myu) = (T - k)o?, 
and from [8.1.19], s? gives an unbiased estimate of o?: 
E(s?) = o. 


Properties of Estimated Residual Variance Under 
Assumption 8.1(a) Through (c) 


When u, is Gaussian, w, is also Gaussian and expression [8.1.23] is the sum 

of squares of (T — k) independent N(0, a?) variables. Thus, 
RSS/o? = u'Myu/o? ~ x?(T — k). [8.1.24] 
Again, it is possible to show that under Assumption 8.1(a) through (c), no 


other unbiased estimator of a? has a smaller variance than does s?.4 
Notice also from [8.1.11] and [8.1.12] that b and @ are uncorrelated: 


E[a(b — B)’] = E[M,uu’X(X’X)~!] = o?M,X(X’X)-! = 0. [8.1.25] 


Under Assumption 8.1(a) through (c), both b and a are Gaussian, so that absence 
of correlation implies that b and @ are independent. This means that b and s? are 
independent. 


t Tests About B Under Assumption 8.1(a) Through (c) 


Suppose that we wish to test the null hypothesis that 6;, the ith element of 
B, is equal to some particular value B?. The OLS t statistic for testing this null 
hypothesis is given by 


(6, - 69) _ (6, - 68) 


oy, s(gty'? * 


i 


t= [8.1.26] 


“See Rao (1973, p. 319). 


204 Chapter 8 | Linear Regression Models 


where é” denotes the row i, column i element of (X'X)~! and 6,, = V/s?€# is the 
standard error of the OLS estimate of the ith coefficient. The magnitude in [8.1.26] 
has an exact ¢ distribution with T — k degrees of freedom so long as x, is deter- 
ministic and u, is i.i.d. Gaussian. To verify this claim, note from [8.1.17] that under 
the null hypothesis, b, ~ N(B°, 07"), meaning that (6, — 6°)/\/o7é ~ N(0, 1). 
Thus, if [8.1.26] is written as 


_ (i = BP) Va7Eh 
a Vs*/o2 , 


the numerator is N(0, 1) while from [8.1.24] the denominator is the square root 
of a x? (T — k) variable divided by its degrees of freedom. Recalling [8.1.25], the 
numerator and denominator are independent, confirming the exact ¢ distribution 
claimed for [8.1.26]. 


F Tests About B Under Assumption 8.1(a) Through (c) 


More generally, suppose we want a joint test of m different linear restrictions 
about B, as represented by 


Hy: RB = r. [8.1.27] 


Here R is a known (m x k) matrix representing the particular linear combinations 
of B about which we entertain hypotheses and r is a known (m x 1) vector of the 
values that we believe these linear combinations take on. For example, to represent 
the simple hypothesis 8; = 8? used previously, we would have m = 1, Ra (1 x k) 
vector with unity in the ith position and zeros elsewhere, and r the scalar 8°. As 
a second example, consider a regression with k = 4 explanatory variables and the 
joint hypothesis that 8; + 6. = 1 and B, = fy. In this case, m = 2 and 


110 #O 1]: 
ate) tel. See 
Notice from [8.1.17] that under Ho, 
Rb ~ Mr, o?R(X’X)7'R’). [8.1.29] 


A Wald test of Hy is based on the following result. 


Proposition 8.1: Consider an (n X 1).vector z ~ N(O, Q) with Q nonsingular. 
Then 2'Q-'z ~ x(n). 


For the scalar case (n = 1), observe that if z ~ N(0, 7), then (z/x) ~ N(0, 1) 
and z*/a? ~ y(1), as asserted by the proposition. 

To verify Proposition 8.1 for the vector case, since 2 is symmetric, there 
exists a matrix P, as in [8.1.20] and [8.1.21], such that 2 = PAP’ and P’P = I, 
with A containing the eigenvalues of 2. Since 2 is positive definite, the diagonal 
elements of A are positive. Then 

z'Q-1z = z'(PAP’)~!z 
= 2 [P'] A !p- lz 
[P~'z]'A-'!P-!z 
= wA-!w _ [8.1.30] 


n 
> w?/A,, 


im] 


8.1. Review of Ordinary Least Squares 205 


where w = P~'z. Notice that w is Gaussian with mean zero and variance 
E(ww') = E(P-'z2’[P’]-') = P-‘Q[P’]-? = P-'PAP'[P']-! = A. 
Thus [8.1.30] is the sum of squares of n independent Normal variables, each divided 


by its variance A,. It accordingly has a x(n) distribution, as claimed. 
Applying Proposition 8.1 directly to [8.1.29], under Hp; 


(Rb — r)’[o2R(X'X)~'R’]-'(Rb — r) ~ y(m). [8.1.31] 


Replacing o? with the estimate s* and dividing by the number of restrictions gives 
the Wald form of the OLS F test of a linear hypothesis: 


F = (Rb — r)'[s?R(X’X)-'R']-1(Rb — r)/m. [8.1.32] 
Note that [8.1.32] can be written 
p= (Rb = 5)'[o°R(K'K)'R']“ (Rb — r)/m 
[RSSI(T — k)\/o? 


The numerator is a y?(m) variable divided by its degrees of freedom, while the 
denominator is a y?(T — k) variable divided by its degrees of freedom. Again, 
since b and @ are independent, the numerator and denominator are independent 
of each other. Hence, [8.1.32] has an exact F(m, T — k) distribution under H, 
when x, is nonstochastic and u, is i.i.d. Gaussian. 

Notice that the ¢ test of the simple hypothesis 8, = 8? is a special case of the 
general formula [8.1.32], for which 


F = (b; — B?)[s?é"]" (6, — B?). [8.1.33] 


This is the square of the ¢ statistic in [8.1.26]. Since an F(1, T — k) variable is just 
the square of a“(T — k) variable, the identical answer results from (1) calculating 
[8.1.26] and using t tables to find the probability of so large an absolute value for 
a «(T — k) variable, or (2) calculating [8.1.33] and using F tables to find the 
probability of so large a value for an F(1, T — &) variable. 


A Convenient Alternative Expression for the F Test 


It is often straightforward to estimate the model in [8.1.1] subject to the 
restrictions in’ [8.1.27]. For example, to impose a constraint 6, = 8° on the first 
element of B, we could just do an ordinary least squares regression of y, — 
Box, ON Xy,, Xx, .- +, Xy- The resulting estimates b}, b3, ..., bf minimize 
ei {O. — BX) — bi xy, — b3x3, — +--+ — bgx,, with respect to bF, b3,..., 
bf and thus minimize the residual sum of squares [8.1.2] subject to the constraint 
that 6, = 69. Alternatively, to impose the constraint in [8.1.28], we could regress 
Ve — Xx ON (xq, — Xz.) and (x3, + Xa): 

Yr — Xx = Bite — X22) + Bsa, + Xue) + Ue 
The OLS estimates bf and b} minimize 
T 
D (Ce — %2) — bu — ar) — Fs, + Xa)P 
a T [8.1.34] 
= Dy [ye — Dixy — (1 — bf )xa, — bFx3, — bF x4.) 
t= « 


and thus minimize [8.1.2] subject to [8.1.28]. 
Whenever the constraints in [8.1.27] can be imposed through a simple OLS 
regression on transformed variables, there is an easy way to calculate the F statistic 


206 Chapter 8 | Linear Regression Models 


[8.1.32] just by comparing the residual sum of squares for the constrained and 
unconstrained regressions. The following result is established in Appendix 8.A at 
the end of this chapter. 


Proposition 8.2: Let b denote the unconstrained OLS estimate [8.1.6] and let RSS, 
be the residual sum of squares resulting from using this estimate: 


r . 
RSS, = & (y, — xb? [8.1.35] 
tel 


Let b* denote the constrained OLS estimate and RSSp the residual sum of squares 
from the constrained OLS estimation: 


T 
RSS, = >, (y, — x/b*)?. [8.1.36] 
t= 


Then the Wald form of the OLS F test of a linear hypothesis [8.1.32] can equivalently 
be calculated as (RSS ' 

= ‘o — RSS,)/m 
F= ~ RSSAT — k) =k) [8.1.37] 

Expressions [8.1.37] and [8.1.32] will generate exactly the same number, 
regardless of whether the null hypothesis and the model are valid or not. 

For example, suppose the sample size is T = 50 observations and the null 
hypothesis is 83 = B, = 0 in an OLS regression with k = 4 explanatory variables. 
First regress y, ON X4,, X25 X3;, X4, and call the residual sum of squares from this 
regression RSS,. Next, regress y, on just x,, and x, and call the residual sum of 
squares from this restricted regression RSSp. If 


(RSS) ~ RSS,)2 
RSS,/(50 — 4) 


is greater than 3.20 (the 5% critical value for an F(2, 46) random variable), then 
the null hypothesis should be rejected. 


8.2. Ordinary Least Squares Under More General 
Conditions 


The previous section analyzed the regression model | 
y= XB + u, 


under the maintained Assumption 8.1 (x, is deterministic and u, is iid. Gaussian). 
We will hereafter refer to this assumption as “‘case 1.” This section generalizes this 
assumption to describe specifications likely to arise in time series analysis. Some 
of the key results are summarized in Table 8.1. 


Case 2. Error Term i.i.d. Gaussian and Independent 
of Explanatory Variables 


Consider the case in which X is stochastic but completely independent of u. 


Assumption 8.2:5 (a) x, stochastic and independent of u, for all t, s; (b) u, ~ iid. 
N(O, o?). ; 


5This could be replaced with the assumption u|X ~ N(0, o71,) with all the results to follow un- 
changed. 


8.2. Ordinary Least Squares Under More General Conditions 207 


Many of the results for deterministic regressors continue to apply for this 
case. For example, taking expectations of [8.1.12] and exploiting the independence 
assumption, 


E(b) = B + {E[(X'X)~'X"]HE(u)} = B, [8.2.1] 


so that the OLS coefficient remains unbiased. 

The distribution of test statistics for this case can be found by a two-step 
procedure. The first step evaluates the distribution conditional on X; that is, it 
treats X as deterministic just as in the earlier analysis. The second step multiplies 
by the density of X and integrates over X to find the true unconditional distribution. 
For example, [8.1.17] implies that 


bIX ~ N(B, o2(X’X)~). [8.2.2] 


If this density is multiplied by the density of X and integrated over X, the result 
is no longer a Gaussian distribution; thus, b is non-Gaussian under Assumption 
8.2. On the other hand, [8.1.24] implies that 


RSS|X ~ o?-x?(T — k). 


But this density is the same for all X. Thus, when we multiply the density of RSS|X 
by the density of X and integrate, we will get exactly the same density. Hence, 
[8.1.24] continues to give the correct unconditional distribution for Assumption 
8.2. 

The same is true for the t and F statistics in [8.1.26] and [8.1.32]. Conditional 
on X, (6, — B?)Ao(é")'7] ~ N(0, 1) and s/o is the square root of an independent 
[1(T — &)]-x2(T — k) variable. Hence, conditional on X, the statistic in [8.1.26] 
has a t(T — k) distribution. Since this is true for any X, when we multiply by the 
density of X and integrate over X we obtain the same distribution. 


Case 3. Error Term i.i.d. Non-Gaussian and Independent 
of Explanatory Variables 


Next consider the following specification. 


Assumption 8.3: (a) x, stochastic and independent of u, for all t, s; (b) u, non- 
Gaussian but i.i.d. with mean zero, variance o7, and E(u}) = py < ©; (c) E(x,x;) 
= Q,, a positive definite matrix with (1/T)27.,Q, > Q, a positive definite matrix; 
(d) E(XuXpXp%me) < © for all i, j, l, m, and t; (e) (1/T)EZ4(x,x/) > Q. 


Since result [8.2.1] required only the independence assumption, b continues 
to be unbiased in this case. However, for hypothesis tests, the small-sample dis- 
tributions of s? and the ¢t and F statistics are no longer the same as when the 
population residuals are Gaussian. To justify the usual OLS inference rules, we 
have to appeal to asymptotic results, for which purpose Assumption 8.3 includes 
conditions (c) through (e). To understand these conditions, note that if x, is co- 
variance-stationary, then E(x,x,;) does not depend on t. Then Q, = Q for all t and 
condition (e) simply requires that x, be ergodic for second moments. Assumption 
8.3 also allows more general processes in that E(x,x;) might .be different for dif- 
ferent t, so long as the limit of (1/T)27 , E(x,x;) can be consistently estimated by 
(VT) EF. (x,x!). 


208 Chapter 8 | Linear Regression Models 


-[4Z°7'8] Aq waatd D ‘sioua wapusdapur WAN uorssas8as0yne AreUONeIS >p asVD 

x6) > *x'xZ,_.L ‘RN Jo wWapuodaput x ‘(41,0 ‘9) TeIssNeH-UOU ~ A ‘onseyooys X 2¢ asDD 

“nh Jo yuapuodapul Xx ‘(41,2 ‘O)N ~ 8 ‘ONseYIOIS X *Z 3ST 

(44,2 ‘O)N ~ B SONsBYyDOIsMOU X +7 asUD 

‘(in)q sarouap *7f ‘[z¢"1'8] 4q onsuers 7 pue ‘[9z"1'8] Aq ouspeas 4 ‘[gt-r's] Aq -s ‘[9'1'8] Aa vans stg ‘a + JX = A SI [apow UOlssaIB9y 


RE 


(ul) _X «tg (1 ‘ON +44 (0 — *0 ‘ON < (o - 45)LA (:-022 ‘ON <—(d — OLA 
pastiq poseiq p ava 
(ul) X <— tq (1 ‘ON <7? (,0 — * ‘ON < (o — 25)LA (:-029 ‘ON <— (db - “DLA 
paserqun paseiqun € asvD 
(y — pL ‘mg exe (y — 1) EKA (Y -— L)X ~ 29/2504 - L) urissnep-uoUu 
paserqun poseiqun Z aso 
(y — Z ‘my exe (¥ — Ls EKA (4 — L)X ~ 2/2804 - L) (:-(X%,X)29 ‘dN ~4 
paseiqun poseiqun [ asp 
a 
DUSHUIS JJ IUSHDIS 3 28 OUD A q suauffa0D 


suoyduinssy snoLieA JapUL SINISHWIG 189, PUB SeJEMITISY S70 JO sensadorg 
V8 ATAVL 


209 


To describe the asymptotic results, we denote the OLS estimator [8.1.3] by 
b; to emphasize that it is based on a sample of size T. Our interest is in the behavior 
of b; as T becomes large. We first establish that the OLS coefficient estimator is 
consistent under Assumption 8.3, that is, that by 4 B. 

Note that [8.1.12] implies 


[= xx] [= x [8.2.3] 
[wn sx lw) > a 


Consider the first term in [8.2.3]. Assumption 8.3(e) and Proposition 7.1 imply 
that é; 


N 


b; — B 


T ~1 
[wn D3 xx| 4.Q-1, [8.2.4] 
r=1 
Considering next the second term in [8.2.3], notice that x,u, is a martingale dif- 
ference sequence with variance-covariance matrix given by 
E(x,ux;u) = {E(x,x;)}-o?, 


which is finite. Thus, from Example 7.11, 


T 
[wn By xu +0. [8.2.5] 
t= 
Applying Example 7.2 to [8.2.3] through [8.2.5], 
b; — B-> Q-'0 = 0, 


verifying that the OLS estimator is consistent. 
Next turn to the asymptotic distribution of b. Notice from [8.2.3] that 


VT(b; — B) = [wn > ox] avn 5 xu} [8.2.6] 


We saw in [8.2.4] that the first term converges in probability to Q~'. The second 
term is \/T times the sample mean of x,u,, where x,u, is a martingale difference 
sequence with variance o?-E(x,x;) = o7Q, and (1/T)=7_,07Q, > o’Q. Notice that 
under Assumption 8.3 we can apply Proposition 7.9: 


lawn s x > N(0, oQ). [8.2.7] 
tml 
Combining [8.2.6], [8.2.4], and [8.2.7], we see as in Example 7.5 that 
VT (br — B) > N(O, [Q-*-(0?Q)-Q-']) = NO, o?Q-). [8.2.8] 
In other words, we can act as if 
b, ~ N(B, o°Q-/T), [8.2.9] 


where the symbol ~ means “‘is approximately distributed.” Recalling Assumption 
8.3(e), in large samples Q should be close to (1/T)27_,x,x;. Thus Q~'/T should 
be close to [£7.,x,x;]~! = (X7X7)~' for X; the same (T x k) matrix that was 
represented in [8.1.5] simply by X (again, the subscript T is added at this point to 


210 Chapter 8 | Linear Regression Models 


emphasize that the dimensions of this matrix depend on T). Thus, [8.2.9] can be 
approximated by 


b; ~ N(B, o°(X7X7)~"). 
This, of course, is the same result obtained in [8.1.17], which assumed Gaussian 
disturbances. With non-Gaussian disturbances the distribution is not exact, but 
provides an increasingly good approximation as the sample size grows. 


Next, consider consistency of the variance estimate s+. Notice that the pop- 
ulation residual sum of squares can be written 


(yr — X7B)'(yr — X7B) ; 
= (yz — X;br + X,b; - X7B)'(yr — Xrb; + Xrpb7 - X7B) [8.2.10] 
= (yr ~ Xrbz)'(yr — Xrbz) + (Xzbz - X,B)'(X7b; — X-B), 
where cross-product terms have vanished, since 
(yr — X7bz)’X7(b; — B) = 0, 
by the OLS orthogonality condition [8.1.10]. Dividing [8.2.10] by T, 
(/T)(yr — X7B)'(yr — X7B) 
= (1/T)(yr — Xrbz)'(yr — Xrbz) + (1/T)(br ~ B)'X7X7(bz — B), 
or 
(1/T)(yr — X7bz)'(yr — Xrbz) 
= (V/T)(uzur) — (by ~ B)'(X7X7/T)(by — B). [8.2.11] 


Now, (1/T)(uzu;) = (1/T)27,u2, where {u?} is an i.i.d. sequence with mean o?. 
Thus, by the law of large numbers, 


(/T)(upu7) > o?. 


For the second term in [8.2.11], we have (X‘-X 7/T) > Q and (b, — B) > 0, and 
so, from Proposition 7.1, 


(br — B)'(X7X,/T)(br — B) > 0'Q0 = 0. 


Substituting these results into [8.2.11], 


(/T)(yr — Xrbr)'(yr — X7b7) > 0”. [8.2.12] 
Now, [8.2.12] describes an estimate of the variance, which we denote 63: 
6% = (V/T)(yz — Xrbr)'(yr ~ Xzbz). [8.2.13] 
The OLS estimator given in [8.1.18], 
s} = [1(T — k)\(yr — Xrbz)'(yr — Xrbrz), [8.2.14] 


differs from 64 by a term that vanishes as T > ©, 
sj = ar 6%, 
where a; = [T/(T — k)] with lim;,.. a7 = 1. Hence, from Proposition 7.1, 
plim s3 = 1-o?, 
establishing consistency of s}. 


8.2. Ordinary Least Squares Under More General Conditions 211 


To find the asymptotic distribution of s?, consider first /T(¢2 — o?). From 
ymp 


[8.2.11], this equals 
VI1(6% - 0?) = (UVT)(uyu,) — VTo? [8.2.15] 
— VT (br — B)'(X7X7/T)(bz — B). a 


But 
(VT)(urur) ~ VTo? = (VT) > (ui - &%), 


where {u? — o7} is a sequence of i.i.d. variables with mean zero and variance 
E(u? ~ o?)? = E(us) — 20°E(u?) + o* = p, — o*. Hence, by the central limit 
theorem, 

(/VT)(u'-uz) — VTo? > N(0, (14 - 04). [8.2.16] 


For the last term in [8.2.15], we have VT(b; — B) > N(0, 02Q-!), (XX,/T) 
4, Q, and (by — B) > 0. Hence, 


VT(b; — B)'(X7X7/T)(by — B) > 0. [8.2.17] 
Putting [8.2.16] and [8.2.17] into [8.2.15], we conclude 
VT(6% — 0?) > N(O, (14 — 0%). [8.2.18] 


To see that s7 has this same limiting distribution, notice that 
VI(s} — 9?) — VT(6% - 0?) = VI{[TT — k)]67 - 67} 
= [(kVTT — b)16. 
But lim... [((kKWT)/(T — k)] = 0, establishing that 
VT(s} — 0?) — VT(64 - 0?) > 0-0? = 0 
and hence, from Proposition 7.3(a), 
VI1(s2. ~-02) > N(O, (44 - 0%). [8.2.19] 


Notice that if we are relying on asymptotic justifications for test statistics, 
theory offers us no guidance for choosing between s? and 6? as estimates of a7, 
since they have the same limiting distribution. 

Next consider the asymptotic distribution of the OLS t test of the null hy- 
pothesis 6; = 69, 

1 = (er = BP) _ VT(bir — BP) [8.2.20] 

- sp VER spVTEE - 
where é# denotes the row i, column i element of (X;X7)~'. We have seen that 
VT(bi7 — B°) = N(0, 7g), where g* denotes the row i, column i element of 
Q-'. Similarly, Té# is the row 7, column i element of (X;X7/T)~' and converges 
in probability to g". Also, s;- o. Hence, the ¢ statistic [8.2.20] has a limiting 
distribution that is the same as a N(0, o2q“) variable divided by oq"; that is, 


tr > N(0, 1). [8.2.21] 


Now, under the more restrictive conditions of Assumption 8.2, we saw that 
t; would have a ¢ distribution with (T — k) degrees of freedom. Recall that at 
variable with N degrees of freedom has the distribution of the ratio of a N(0, 1) 
variable to the square root of (1/N) times an independent x?(N) variable. But a 
x7(N) variable in turn is the sum of N squares of independent N(0, 1) variables. 


212 Chapter 8 | Linear Regression Models 


Thus, letting Z denote a N(0, 1) variable, a t variable with N degrees of freedom 
has the same distribution as 


Zz 

t => “ga 9 na a der TERR FATA 

NO {((Z2 + Z3 4 +++ + Z2)/N}2 
By the law of large numbers, 

(Zi + ZB +--+ + ZEN E(Z?) =1, 

and so ty = N(0, 1). Hence, the critical value for a t variable with N degrees of 
freedom will be arbitrarily close to that for a N(0, 1) variable as N becomes large. 
Even though the statistic calculated in [8.2.20] does not have an exact t(T — k) 
distribution under Assumption 8.3, if we treat it as if it did, then we will not be 


far wrong if our sample is sufficiently large. 
The same is true of [8.1.32], the F test of m different restrictions: 


F, = (Rb; ~ r)'[s7R(X;X7)~'R']“\(Rb; — r)/m 
= VT(Rb; — r)'[s}R(X7X,/T)-'R']-! VT(Rb; — r)/m. 
Here s2 5 o, X,X,/T 4 Q, and, under the null hypothesis, 
VT(Rb; — r) = [RVT(b; — B)] 
+ N(0, ¢?RQ-'R’). 


[8.2.22] 


Hence, under the null hypothesis, 
m:-F, > [RVT(bz — B)]'[o27RQ-'R']- "[RVT(bz — B)]. 
This is a quadratic function of a Normal vector of the type described by Proposition 
8.1, from which 
m- Fr 4 x?(m). 
Thus an asymptotic inference can be based on the approximation 
(Rb; — r)'[s}R(X;X7)~'R']- (Rb; — r) ~ x2(m). [8.2.23] 

This is known as the Wald form of the OLS x? test. 

As in the case of the t and limiting Normal distributions, viewing [8.2.23] as 
x(m) and viewing [8.2.22] as F(m, T — k) asymptotically amount to the same 
test. Recall that an F(m, N) variable is a ratio of a x?(m) variable to an indepen- 


dent y?(N) variable, each divided by its degrees of freedom. Thus, if 7, denotes 
a N(0, 1) variable and X a x?(m) variable, 


F..= X/m 
mN (22 + Z2 +--+ + ZEN 
For the denominator, 
(Z2 + Z2 +--+ + ZRVN E(Z?) =1,. 
implying 
L 
Fn,N Meck X/m, 

Hence, comparing [8.2.23] with a x?(m) critical value or comparing [8.2.22] with 
an F(m, T — k) critical value will result in the identical test for sufficiently large 
T (see Exercise 8.2). 


For a given sample of size T, the small-sample distribution (the t or F dis- 
tribution) implies wider confidence intervals than the large-sample distribution (the 


8.2. Ordinary Least Squares Under More General Conditions 213 


Normal or y? distribution). Even when the justification for using the ¢ or F distri- 
bution is only asymptotic, many researchers prefer to use the t or F tables rather 
than the Normal or x? tables on the grounds that the former are more conservative 
and may represent a better approximation to the true small-sample distribution. 

If we are relying only on the asymptotic distribution, the Wald test statistic 
[8.2.23] can be generalized to allow a test of a nonlinear set of restrictions on 8. 
Consider a null hypothesis consisting of m separate nonlinear restrictions of the 
form g(B) = 0 where g: R‘ — R” and g(-) has continuous first derivatives. Result 
[8.2.8] and Proposition 7.4 imply that 

Is 
B= By 


VT{g(b;) — g(Bo)] > [2 


where z ~ N(0, o?Q-') and 


38 
ap’ 


B=Bo 


denotes the (m x k) matrix of derivatives of g(-) with respect to B, evaluated at 
the true value Bo. Under the null hypothesis that g(Bo) = 0, it follows from 


Proposition 8.1 that 
og 
o’?Q-! E 
f= 7 ap 


Recall that Q is the plim of (1/T)(X;X7). Since dg/aB’ is continuous and since 
b; -> Bo, it follows from Proposition 7.1 that 
al 


OB ' |g -5, op 
Hence a set of m nonlinear restrictions about B of the form g(B) = 0 can be tested 
with the statistic 


taco {[ 28 


ryt 7 
ap’ road) (VT a(b} > x24). 


{(VT-g(bp)} AES 


}] {g(bz)} > x*(m). 
B=br. 


|secrxn| 38 
B=br. B 


Note that the Wald test for linear restrictions [8.2.23] can be obtained as a special 
case of this more general formula by setting g(B) = RB — r. 

One disadvantage of the Wald test for nonlinear restrictions is that the answer 
one obtains can be different depending on how the restrictions g(B) = 0 are 
parameterized. For example, the hypotheses 6, = B, and B,/B. = 1 are equivalent, 
and asymptotically a Wald test based on either parameterization should give the 
same answer. However, in a particular finite sample the answers could be quite 
different. In effect, the nonlinear Wald test approximates the restriction g(b;) = 
0 by the linear restriction 

og 


2(Bo) + E 


]o= — Bo) = 9. 


B=Bo 


Some care must be taken to ensure that this linearization is reasonable over the 
range of plausible values for B. See Gregory and Veall (1985), Lafontaine and 
White (1986), and Phillips and Park (1988) for further discussion. 


214 Chapter 8 | Linear Regression Models 


Case 4. Estimating Parameters for an Atitoregression 


Consider now estimation of the parameters of a pth-order autoregression by 
OLS. 


Assumption 8.4; The regression model is 


Ye = C+ PiV-1 + G2Yr-2 to + bpp + Er [8.2.24] 


with roots of (1 ~ ¢1z — q:2? — +++ — $2") = 0 outside the unit circle and 
with {e,} an i.i.d. sequence with mean zero, variance a7, and finite fourth moment 
Ma. 


An autoregression has the form of the standard regression model y, = 
xB + u, with x; = (1, 1, ¥-2, +--+» Yep) and u, = «, Note, however, that 
an autoregression cannot satisfy condition (a) of Assumption 8.2 or 8.3. Even 
though u, is independent of x, under Assumption 8.4, it will not be the case that 
u, is independent of x,,.,. Without this independence, none of the small-sample 
results for case 1 applies. Specifically, even if ¢, is Gaussian, the OLS coefficient 
b gives a biased estimate of B for an autoregression, and the standard ¢t and F 
statistics can only be justified asymptotically. 

However, the asymptotic results for case 4 are the same as for case 3 and are 
derived in essentially the same way. To adapt the earlier notation, suppose that 
the sample consists of T + p observations on y,, numbered (y_p41, Y-pa2s +++ 
Yo: Yt» » + +5 Yr); OLS estimation will thus use observations 1 through T. Then, as 
in [8.2.6], 


VT(b; - B) = [an > wx] ee >} a [8.2.25] 


The first term in [8.2.25] is 


lan >> xx | ; 


tal 
-1 


1 T~'3y,-1 T~*2y,_2 — T~"Sy,-p 
T-'Sy-) To 'Syhy To yaya 0 TO yyy 
= | T-'Sy,2 T7'2y-2y;,-1 T~'3y2_2 aan T~'Sy,-2Ye-p 
T-'Sy,-p To 'Zy,-p¥-1 T-'ZypYin2 °° = T- 2yj-» 


where & denotes summation over t = 1 to T. The elements in the first row or 
column are of the form T~'Zy,_, and converge in probability to u = E(y,), by 
Proposition 7.5. Other elements are of the form T~!Zy,_;y,_,, which, from [7.2.14], 
converges in probability to 


E(y,-:¥r-j) = Way + Be 
Hence 


lun > ex] 4Q71 [8.2.26] 


8.2. Ordinary Least Squares Under More General Conditions 215 


where 


1 m Bb SoG m 
BR wte nt eo wit we 

Q=!H Ante yew ++ Y-2te ]. — [8.2.27] 
Mort WP %p-2 tH + 


For the second term in [8.2.25], observe that x,u, is a martingale difference 
sequence with positive definite variance-covariance matrix given by 


E(x,u,ux,) = E(u7)-E(x,x;) = 0°Q. 


Using an argument similar to that in Example 7.15, it can be shown that 


[wn > xu + N(0, 27Q) [8.2.28] 


(see Exercise 8.3). Substituting [8.2.26] and [8.2.28] into [8.2.25], 
VT (br — B) > N(0, 0?Q-?). [8.2.29] 


It is straightforward to verify further that b; and s? are consistent for this 
case. From [8.2.26], the asymptotic variance-covariance matrix of /T(b; — B) 
can be estimated consistently by s3(X7;X7/T)~', meaning that standard ¢t and F 
statistics that treat b; as if it were N(B, s}(X7X7) ~') will yield asymptotically valid 
tests of hypotheses about the coefficients of an autoregression. 

As a special case of [8.2.29], consider OLS estimation of a first-order auto- 
regression, 


v= byr-1 + Er, 


with |¢| < 1. Then Q is the scalar E(y?_,) = yo, the variance of an AR(1) process. 
We saw in Chapter 3 that this is given by o7/(1 — $?). Hence, for ¢ the OLS 
coefficient, 


result [8.2.29] implies that 
VI(br - 4) > N(0, o%[o7/(1 — ¢?)]"") = N(O,1 — #2). [8.2.30] 


If more precise results than the asymptotic approximation in equation [8.2.29] 
are desired, the exact small-sample distribution of ¢, can be calculated in either 
of two ways. If the errors in the autoregression [8.2.24] are N(0, a”), then for any 
specified numerical value for ¢,, ¢2,..., @, and c the exact small-sample distri- 
bution can be calculated using numerical routines developed by Imhof (1961); for 
illustrations of this method, see Evans and Savin (1981) and Flavin (1983). An 
alternative is to approximate the small-sample distribution by Monte Carlo meth- 
ods. Here the idea is to use a computer to generate pseudo-random variables ¢,, 

. , 7, each distributed N(0, o”) from numerical algorithms stich as that described 
in Kinderman and Ramage (1976). For fixed starting values y_,41,..., ¥1, the 


216 Chapter 8 | Linear Regression Models 


values for y,, y2, - - - » Yr can then be calculated by-iterating on [8.2.24].° One 
then estimates the parameters of [8.2.24] with an OLS regression on this artificial 
sample. A new sample is generated for which a new OLS regression is estimated. 
By performing, say, 10,000 such regressions, an estimate of the exact small-sample 
distribution of the OLS estimates can be obtained. 

For the case of a first-order autoregression, it is known from such calculations 
that ¢, is downward-biased in small samples, with the bias becoming more severe 
as ¢ approaches unity. For example, for a sample of size T = 25 generated by 
[8.2.24] with p = 1, c = 0, and ¢ = 1, the estimate $; based on OLS estimation 
of [8.2.24] (with a constant term included) will be less than the true value of 1 in 
95% of the samples, and will even fall below 0.6 in 10% of the samples.” 


Case 5. Errors Gaussian with Known Variance-Covariance 
Matrix 


Next consider the following case. 


Assumption 8.5: (a) x, stochastic; (b) conditional on the full matrix X, the vector 
u is N(O, oV); (c) V is a known positive definite matrix. 


When the errors for different dates have different variances but are uncor- 
related with each other (that is, V is diagonal), then the errors are said to exhibit 
heteroskedasticity. For V nondiagonal, the errors are said to be autocorrelated. 
Writing the variance-covariance matrix as the product of some scalar a? and a 
matrix V is a convention that will help simplify the algebra and interpretation for 
some examples of heteroskedasticity and autocorrelation. Note again that As- 
sumption 8.5(b) could not hold for an autoregression, since conditional on x,,, = 
(1, Yes Ye-ts + + + » Ye-poi)’ and x,, the value of u, is known with certainty. 

Recall from [8.1.12] that 


(b — B) = (X'X)7'X'u. 
Taking expectations conditional on X, 
E[(b — B)IX] = (X'X)-'K"-E(u) = 0, 
and by the law of iterated expectations, 
E(b — B) = Ex{E[(b — B)|X]} = 0. 


Hence, the OLS coefficient estimate is unbiased. 
The variance of b conditional on X is 


Ef{(b — B)(b ~ B)'IX} = ELK)" "Kaw XK) gy 41 
= 0°(X'X)~!X'VX(X'X) “8 , 
Thus, conditional on X, 

b[X ~ (6. o* X'S) IK'VKOEX)") 


‘Alternatively, one can generate the initial values for y with a draw from the appropriate uncon- 
ditional distribution. Specifically, generate a (p x 1) vector vy ~ N(O, I,) and set (y_pais +++ > Yo)’ 
= wl + P-v, where wp = c/(1 — $, — $2. — +** — @), 1 denotes a (p X 1) vector of 1s, and P is 
the Cholesky factor such that P-P’ = I for F the (p x p) matrix whose columns stacked ‘in a (p? x 1) 
vector comprise the first column of the matrix o7[I,2 — (F @ F)]~', where F is the (p x p) matrix 
defined in equation {1.2.3] in Chapter 1. 

’These values can be inferred from Table B.S. 


8.2. Ordinary Least Squares Under More General Conditions 217 


Unless V = I,, this is not the same variance matrix as in [8.1.17], so that the OLS 
t statistic [8.1.26] does not have the interpretation as a Gaussian variable divided 
by an estimate of its standard deviation. Thus [8.1.26] will not have a ¢(T — k) 
distribution in small samples, nor will it even asymptotically be N(0, 1). A valid 
test of the hypothesis that 8, = 89 for case 5 would be based not on [8.1.26] but 
rather on 


(5, — Br) 


' sVdy 7 
where d, denotes the row i, column i element of (X'X)~'X'VX(X'X)—!. This 
statistic will be asymptotically N(0, 1). 

Although one could form an inference based on [8.2.32], in this case in which 
V is known, a superior estimator and test procedure are described in Section 8.3. 
First, however, we consider a more general case in which V is of unknown form. 


[8.2.32] 


Case 6. Errors Serially Uncorrelated but with General 
Heteroskedasticity 


It may be possible to design asymptotically valid tests even in the presence 
of heteroskedasticity of a completely unknown form. This point was first observed 
by Eicker (1967) and White (1980) and extended to time series regressions by 
Hansen (1982) and Nicholls and Pagan (1983). 


Assumption 8.6: (a) x, stochastic, including perhaps lagged values of y; (b) x,u, is 
a martingale difference sequence; (c) E(u?x,x;) = ,, a positive definite matrix, with 
(1/T)=Z. ,Q, converging to the positive definite matrix Q and (1/T) 27. ,u2x,x! > Q; 
(d) E(utxiXjXpXm) < © for all i, j, l, m, and t; (e) plims of (1/T)Z7q Ux yx,X; and 
(A/T) 27 1xyxpX,x; exist and are finite for all i and j and (1/T)27_,x,x; > Q, a 
nonsingular matrix. 


Assumption 8.6(b) requires u, to be uncorrelated with its own lagged values 
and with current and lagged values of x. Although the errors are presumed to be 
serially uncorrelated, Assumption 8.6(c) allows a broad class of conditional het- 
eroskedasticity for the errors. As an example of such heteroskedasticity, consider 
a regression with a single i.i.d. explanatory variable x, with E(x?) = mw, and 
E(x*) = 4. Suppose that the variance of the residual for date ¢ is given by 
E(ui|x,) = a + bx?. Then E(u?x?) = E,[E(u2|x,)-x?] = E,{(a + 6x?)-x?] = ape 
+ by, Thus, 0, = az, + bu, = 2 for all t. By the law of large numbers, 
(1/T) 27 ,u?x? will converge to the population moment 2. Assumption 8.6(c) al- 
lows more general conditional heteroskedasticity in that E(u?x?) might be a func- 
tion of t, provided that the time average of (u?x?) converges. Assumption 8.6(d) 
and (e) impose bounds on higher moments of x and u. 

Consistency of b is established using the same arguments as in case 3. The 
asymptotic variance is found from writing 


VT(b; ~ B) = [an 3 x) Jann) y su} 


Assumption 8.6(e) ensures that 
T -1 
[wm & xxi| 4Q" 
i=l 


218 Chapter 8 | Linear Regression Models 


for some nonsingular matrix Q. Similarly, x,u, satisfies the conditions of Proposition 
7.9, from which 


awn > xu + N(0, Q). 


The asymptotic distribution of the OLS estimate is thus given by 
VT(b; — B) > NO, Q-'Q-). [8.2.33] 


White’s proposal was to estimate the asymptotic variance matrix consistently 
by substituting Q, = (1/T)=7,x,x/ and Q, = (1/T)Z. ,42x,x! into [8.2.33], where 
i, denotes the OLS residual [8.1.4]. The following result is established in Appendix 
8.A to this chapter. 


Proposition 8.3: With heteroskedasticity of unknown form satisfying Assumption 
8.6, the asymptotic variance-covariance matrix of the OLS coefficient vector can be 
consistently estimated by 


0719707! Q-1Q7!. [8.2.34] 


Recalling [8.2.33], the OLS estimate b; can be treated as if 


b; = N(p, ¥,/T) 
where 


Vr = O7'0,07! 
apxwn-{ ain 5 dina [ocx [8.2.35] 


T 
rexxy| 3 dian: oe) 
‘- 


The square root of the row i, column i element of V7/T is known as a 
heteroskedasticity-consistent standard error for the OLS estimate b;. We can, of 
course, also use (V_/T) to test a joint hypothesis of the form RB = r, where R is 
an (m x k) matrix summarizing m separate hypotheses about B. Specifically, 

(Rb; — r)'[R(V,/T)R']~ (Rb; — r) [8.2.36] 
has the same asymptotic distribution as 
[VT (Rb; — r)]'(RQ~'2Q7'R’)~'[VT(Rb; — r)], 
which, from [8.2.33], is a quadratic form of an asymptotically Normal (m x 1) 
vector \/T(Rb; — r) with weighting matrix the inverse of its variance-covariance 
matrix, (RQ~!9Q-'R’). Hence, [8.2.36] has an asymptotic x? distribution with m 
degrees of freedom. 

It is also possible to develop an estimate of the asymptotic variance-covariance 
matrix of b; that is robust with respect to both heteroskedasticity and autocorre- 
lation: 

(¥,/T) 
T 


= xx] 3 0?x,x; 


t= 


T 
+ > 1-— D (x4, ey + Slat) |(REX)~ 
q + 1 revel 


8.2. Ordinary Least Squares Under More General Conditions 219 


Here q is a parameter representing the number of autocorrelations used to ap- 
proximate the dynamics for u,. The square root of the row 7, column i element of 
(V/T) is known as the Newey-West (1987) heteroskedasticity- and autocorrelation- 
consistent standard error for the OLS estimator. The basis for this expression and 
alternative ways to calculate heteroskedasticity- and autocorrelation-consistent 
standard errors will be discussed in Chapter 10. 


8.3. Generalized Least Squares 


The previous section evaluated OLS estimation under a variety of assumptions, 
including E(uu’) # o7I,. Although OLS can be used in this last case, generalized 
least squares (GLS) is usually preferred. ; 


GLS with Known Covariance Matrix 


Let us reconsider data generated according to Assumption 8.5, under which 
u|X ~ N(O, ¢?V) with Va known (T X T) matrix. Since V is symmetric and positive 
definite, there exists a nonsingular (T x T) matrix L such that® 


VO =LL, [8.3.1] 


Imagine transforming the population residuals u by L: 


e 


a =Lu. 
(TX 1) 


This would generate a new set of residuals 4 with mean 0 and variance conditional 
on X given by 


E(aa'|X) = L-E(uu'|X)L’ = Lo? VL’. 
But V = [V~']-! = [L’L]-!, meaning 


E(aa'|X) = oL[L'L]“'L' = oI 7. [8.3.2] 
We can thus take the matrix equation that characterizes the basic regression model, 
y= XP +u, 


and premultiply both sides by L: 
Ly = LXp + Lu, 
to produce a new regression model 
y = XB + a, [8.3.3] 
where 
y=Lly X=LX &=Lu [8.3.4] 


with &|X ~ N(0, o71,). Hence, the transformed model [8.3.3] satisfies Assumption 
8.2, meaning that all the results for that case apply to [8.3.3]. Specifically, the 
estimator 


b = (XX) Xp = (X’L’LX)~!X’L’Ly = (X'V-'X)-“1X'V-ty [8.3.5] 


®We know that there exists a nonsingular matrix P such that V = PP’ and so V-! = [P‘]-'P~?. 
Take L = P~' to deduce [8.3.1]. 


220 Chapter 8 | Linear Regression Models 


is Gaussian with mean B and variance o°(X'X)~! = o?(X’V~'X)~! conditional 
on X and is the minimum-variance unbiased estimator conditional on X. The es- 
timator [8.3.5] is known as the generalized least squares (GLS) estimator. Similarly, 


T 
5 = [I(T — k)] > (y — 7b)? [8.3.6] 
t=1 
has an exact [o?/(T — k)]-v7(T —K) distribution under Assumption 8.5, while 
(Rb — r)'[§2R(X'V-!X)-'R’']- (Rb — r)/m 


has an exact F(m, T — k) distribution under the null hypothesis RB = r. 
We now discuss several examples to make these ideas concrete. 


Heteroskedasticity 


A simple case to analyze is one for which the variance of u, is presumed to 
be proportional to the square of one of the explanatory variables for that equation, 
say, x2): 


x?, O -:: 0 
+s 0 
E(uu’|X) = o? ; =o0?V 
0 0 Xir 
Then it is easy to see that 
Wx) OO --- 0 
L 0 Wx.) BeOS 0 
0 0 nets 1/|x,7| 


satisfies conditions [8.3.1] and [8.3.2]. Hence, if we regress y,/|x,,| on x,/|x,,, all 
the standard OLS output from the regression will be valid. ° 


Autocorrelation 
As a second example, consider 
u, = pl,_1; + &, [8.3.7] 


where |p| < 1 and e, is Gaussian white noise with variance o?. Then 


1 r) p? T-1 
2 p 1 p see pT? 
E(uu’|X) = 5 =e - [HPV [8.3.8] 
ee te Beta 


8.3. Generalized Least Squares 221 


Notice from expression [5.2.18] that the matrix 


vVi-p 0 0:--: 0 0 
—p 10-:-:-- 0 0 

LS 0 Spr sae. NS AD [8.3.9] 
0 0 0 --+: -p 1 


satisfies [8.3.1]. The GLS estimates are found from an OLS regression of y = Ly 
on X = LX; that is, Ken y,V1 — p2 onx,V1 — p2 and y, — py,_, on x, — 
px,-, fort = 2, 3, : 

GLS and Maximum Likelihood Estimation 


Assumption 8.5 asserts that y|K ~ N(XB, o7V). Hence, the log of the like- 
lihood of y conditioned on X is given by 


(— T/2) log(2m) — (1/2) loglo?v| — (1/2)(y — XB)'(o?V)~ (y — XB). 


[8.3.10] 
Notice that [8.3.1] can be used to write the last term in [8.3.10] as 
—(1/2)(y — XB)'(o?V)"(y — XB) 
= — 2 — XB)’ y ~ 
: z [1/(207)](y — XB)'(L'L)(y — XB) [8.3.11] 


~[V/@o*)|(Ly — LXB)'(Ly — LXp) 

—[1/(207)](¥ — XB)'(¥ — XB). 

Similarly, the middle term in [8.3.10] can be written as in [5.2.24]: 
—(1/2) logla?V| = —(T/2) log(a?) + log|det(L)|, [8.3.12] 


where |det(L)| denotes the absolute value of the determinant of L. Substituting 
[8.3.11] and [8.3.12] into [8.3.10], the conditional log likelihood can be written as 
~(T/2) log(27) — (T/2) log(a?) + logldet(L)| 

~ [1/(207)\(¥ - XB)'(¥ - XB). [8.3.13] 
Thus, the log likelihood is maximized with respect to B by an OLS regression of 
y on X,° meaning that the GLS estimate [8.3.5] is also the maximum likelihood 
estimate under Assumption 8.5. 

The GLS estimate b is still likely to be reasonable even if the residuals u are 
non-Gaussian. Specifically, the residuals of the transformed regression [8.3.3] have 
mean 0 and variance o*I;, and so this regression satisfies the conditions of the 
Gauss-Markov theorem—even if the residuals are non-Gaussian, b will have min- 
imum variance (conditional on X) among the class of all unbiased estimators that 
are linear functions of y. Hence, maximization of [8.3.13], or quasi-maximum 
likelihood estimation, may offer a useful estimating principle even for non-Gaussian 
u. 


GLS When the Variance Matrix of Residuals Must 
Be Estimated from the Data 


Up to this point we have been assuming that the elements of V are known a 
priori. More commonly, V is posited to be of a particular form V(@), where 0 is a 


SThis assumes that the parameters of L do not involve B, as is implied by Assumption 8.5. 


222 Chapter 8 | Linear Regression Models 


vector of parameters that must be estimated from the data. For example, with first- 
order serial correlation of residuals as in [8.3.7], V is the matrix in [8.3.8] and 8 
is the scalar p. As a second example, we might postulate that the variance of 
observation t depends on the explanatory variables according to 


E(u}|x,) = o(1 + ayxi, + @2X3,), 


in which case 8 = (aj, @;)’. 

Our task is then to estimate @ and B jointly from the data. One approach is 
to use as estimates the values of 6 and B that maximize [8.3.13]. Since one can 
always form [8.3.13] and maximize it numerically, this approach has the appeal of 
offering a single rule to follow whenever E(uu’|X) is not of the simple form oI. 
However, other, simpler estimators can also have desirable properties. 

It often turns out to be the case that 


V1(X7{V 7(67)] ~*X7)~ (X7[V7(67)}-'y7) 
4% VI(Kr[Vr(8)]- 'X7)~ (XLV 7(80)]~ !y7), 


where V,(0,) denotes the true variance of errors and 6, is any consistent estimate 
of 8. Moreover, a consistent estimate of 8 can often be obtained from a simple 
analysis of OLS residuals. Thus, an estimate coming from a few simple OLS and 
GLS regressions can have the same asymptotic distribution as the maximum like- 
lihood estimator. Since regressions are much easier to implement than numerical 
maximization, the simpler estimates are often used. 


Estimation with First-Order Autocorrelation of Regression 
Residuals and No Lagged Endogenous Variables 


We illustrate these issues by considering a regression whose residuals follow 
the AR(1) process [8.3.7]. For now we maintain the assumption that u|X has mean 
zero and variance o?V(p), noting that this rules out lagged endogenous variables; 
that is, we assume that x, is uncorrelated with u,_,. The following subsection 
comments on the importance of this assumption. Recalling that the determinant 
of a lower triangular matrix is just the product of the terms on the principal diagonal, 
we see from [8.3.9] that det(L) = VI — p?. Thus, the log likelihood [8.3.13] for 
this case is 


—(T/2) log(2m) — (T/2) log(o?) + (1/2) log(1 — p?): 
-— [GQ - p*)/2e’)](y1 - xiB)? [8.3.14] 


: — [V207)] BY [0 — xB) — pOr-1 ~ x/-1B)P- 


One approach, then, is to maximize [8.3.14] numerically with respect to B, p, and 
a’. The reader may recognize [8.3.14] as the exact log likelihood function for an 
AR(1) process (equation [5.2.9]) with (y, — 4) replaced by (y, — x;B). 

Just as in the AR(1) case, simpler estimates (with the same asymptotic dis- 
tribution) are obtained if we condition on the first observation, seeking to maximize 


—[(T - 1)/2] log(2z) - [(T — 1)/] log(o?) 
~ [1/207)] 3 [0% — xB) — pra — *¢-1B)F. [8.3.15] 


If we knew the value of p, then the value of B that maximizes [8.3.15] could be 
found by an OLS regression of (y, — py,-1) on (x, — pX,-,) fort = 2,3,..., 


8.3. Generalized Least Squares 223 


T (call this regression A). Conversely, if we knew the value of B, then the value 
of p that maximizes [8.3.15] would be found by an OLS regression of (y, — 
x/B) on (y,-; — X;-1B) fort = 2,3,..., T (call this regression B). We can thus 
start with an initial guess for p (often p = 0), and perform regression A to get an 
initial estimate of B. For p = 0, this initial estimate of B would just be the OLS 
estimate b. This estimate of B can be used in regression B to get an updated 
estimate of p, for example, by regressing the OLS residual &, = y, — x/b on its 
own lagged value. This new estimate of p can be used to repeat the two regressions. 
Zigzagging back and forth between A and B is known as the iterated Cochrane- 
Orcutt method and will converge to a local maximum of [8.3.15]. 

Alternatively, consider the estimate of p that results from the first iteration 
alone, 


T 
(/T) 2 a,_ 0, 
é=— >, [8.3.16] 
(1/T) > a2, 


tal 


where @, = y, — x;b and b is the OLS estimate of B. To simplify expressions, we 
have renormalized the number of observations in the original sample to T + 1, 
denoted yo, y,,. . . , Yr, 80 that T observations are used in the conditional maximum 
likelihood estimation. Notice that 


a = (y, — Bx, + BX, - b’x,) = u, + (B — b)’x,, 


allowing the numerator of [8.3.16] to be written 
T 
(UT) 2D tatt-1 


= (UT) > [us + (B~ bY'xfmar + (B - bY'x,-1] 
= (VT) D (uu) + (B - by'CUT) S ieee eee 


+ (8 - wy [ar S rin - »). 


As long as b is a consistent estimate of B and boundedness conditions ensure that 
plims of (1/T)27_ ,u,x,_,, (1/T)27,u,_,x,, and (1/T) 27 ,x,x/_, exist, then 


T T 
(1/T) >) 00,15 (A/T) D uety—1 
t=1 tml 


= (1/T) ) (¢, + pus); [8.3.18] 


& p-Var(u). 


Similar analysis establishes that the denominator of [8.3.16] converges in probability 
to Var(u), so that 6 > p. 

If u, is uncorrelated with x, fors = t — 1, t, and t + 1, one can make the 
stronger claim that an estimate of p based on an autoregression of the OLS residuals 
i, (expression [8.3.16]) has the same asymptotic distribution as an estimate of p 
based on the true population residuals u,. Specifically, if plim[(1/T)27,u,x,_,] = 


224 Chapter 8 | Linear Regression Models 


plim((1/T)27.,u,_1x,] = 0, then multiplying [8.3.17] by VT, we find 


(V/VT) . Ua, , 
= (i/VT) > (u,u,-1) + VI(B — b)'(i/T) > (u,X,-1 + Uy—1X,) 


+ VT(B - wy [am 3 xa. 6 - b) 


a [8.3.19] 
S QIVT) 3 (users) + VT(B ~ by'0 
+ VT(B - by pim| a7 > xxi.fo 
= (i/VT) py (u,U,_4). 
Hence, 
G/T) Saad, (UT) 3 
Vt| ——__| 4 vt| ———__ |. [8.3.20] 


T T 
(1/T) >» a, (1/T) >» uz_, 


The OLS estimate of p based on the population residuals would have an 
asymptotic distribution given by [8.2.30]: 


(WT) S a-12, 
VT} - p| 5.NO@o, (1 - 6%). [8.3.21] 
(1/7) d a2, 


Result [8.3.20] implies that an estimate of p has the same asymptotic distri- 
bution when based on any consistent estimate of B. If the Cochrane-Orcutt iter- 
ations are stopped after just one evaluation of , the resulting estimate of p has 
the same asymptotic distribution as the estimate of p emerging from any subsequent 
step of the iteration. 

The same also turns out to be true of the GLS estimate b. 


Proposition 8.4: Suppose that Assumption 8.5(a) and (b) holds with V given by 
[8.3.8] and |p| < 1. Suppose in addition that (1/T)37_ xu, -> 0 for all s and that 
(1/T)EZ..x,x/ and (1/T)271x,x/_, have finite plims. Then the GLS estimate b 
constructed from V() for p given by [8.3.16] has the same asymptotic distribution 
as b constructed from V(p) for the true value of p. 


Serial Correlation with Lagged Endogenous Variables 


An endogenous variable is a variable that is correlated with the regression 
error term u,. Many of the preceding results about serially correlated errors no 


8.3. Generalized Least Squares 225 


longer hold if the regression contains lagged endogenous variables. For example, 
consider estimation of 


Ye = By-1 + YX, + Ue, [8.3.22] 


where u, follows an AR(1) process as in [8.3.7]. Since (1) u, is correlated with u,_, 
and (2) u,_, is correlated with y,_,, it follows that u, is correlated with the explana- 
tory variable y,.,. Accordingly, it is not the case that plim{(1/7)27_,x,u,] = 0, 
the key condition required for consistency of the OLS estimator b. Hence, # in 
[8.3.16] is not a consistent estimate of p. 

If one nevertheless iterates on the Cochrane-Orcutt procedure, then the al- 
gorithm will converge to a local maximum of [8.3.15]. However, the resulting GLS 
estimate b need not be a consistent estimate of B. Notwithstanding, the global 
maximum of [8.3.15] should provide a consistent estimate of B. By experimenting 
with start-up values for iterated Cochrane-Orcutt other than p = 0, one should 
find this global maximum.!° 

A simple estimate of p that is consistent in the presence of lagged endogenous 
variables was suggested by Durbin (1960). Multiplying [8.3.22] by (1 — pL) gives 


¥, = (9 + B)YH1 — PBY:~2 + YXi — PYEi-1 + Er [8.3.23] 
This is a restricted version of the regression model 
Ve = Q1Yr-1 + Ogyp-2 + 3X, + GyX,-1 + E,, [8.3.24] 


where the four regression coefficients (a1, @, a3, a4) are restricted to be nonlinear 
functions of three underlying parameters (p, B, y). Minimization of the sum of 
squared e’s in [8.3.23] is equivalent to maximum likelihood estimation conditioning 
on the first two observations. Moreover, the error term in equation [8.3.24] is 
uncorrelated with the explanatory variables, and so the a’s can be estimated con- 
sistently by OLS estimation of [8.3.24]. Then — &,/&; provides a consistent estimate 
of p despite the presence of lagged endogenous variables in [8.3.24]. 

Even if consistent estimates of p and B are obtained, Durbin (1970) empha- 
sized that with lagged endogenous variables it will still not be the case that an 
estimate of p'based on (y, — x/B) has the same asymptotic distribution as an 
estimate based on (y, — x; B). To see this, note that if x, contains lagged endogenous 
variables, then [8.3.19] would no longer be valid. If x, includes y,_,, for example, 
then x, and u,_, will be correlated and plim[(1/T)27_ ,u,_,x,] # 0, as was assumed 
in arriving at [8.3.19]. Hence, [8.3.20] will not hold when x, includes lagged en- 
dogenous variables. Again, an all-purpose procedure that will work is to maximize 
the log likelihood function [8.3.15] numerically. 


Higher-Order Serial Correlation" 


Consider next the case when the distribution of u|X can be described by a 
pth-order autoregression, 


U, = Pyl;_y + Poll_2 +++ + pyli_p + &. 


See Betancourt and Kelejian (1981). 
"This discussion is based on Harvey (1981, pp. 204-6). 


226 Chapter 8 | Linear Regression Models 


The log likelihood conditional on X for this case becomes 
—(T/2) log(2m) — (T/2) log(a?) — (1/2) loglV,| 
— [W/2e)\(y, — X,B)'V7 (yp — X,B) 


T 
- [1/20] >a lo, ~ xiB) — p(yi-1 — x/-1B)  [8:3.25] 
2 
~ PAY -2 > X;-2B) eae ye Po(Yr—p = X;_-,B) ’ 


where the (p x 1) vector y, denotes the first p observations on y, X, is the (p x k) 
matrix of explanatory variables associated with these first p observations, and 
o’V,, is the (p X p) variance-covariance matrix of (y,|X,). The row /, column j 
element of oV, is given by y,_, for y, the kth autocovariance of an AR(p) process 
with autoregressive parameters p,,p2,... , 9, and innovation variance o’. Letting 
L, denote a (p X p) matrix such that L’L, = V,', GLS can be obtained by 
regressing y, = L,y, on X, = L,X, and ¥, = y, — piyi-y — P2¥r-2 — °° 
PpYr—p ON Ky = Xp — PyXr-1 — P2Xr-2 — °' * — PpX,-,pfort=pt+1,p+2,..., 
T. Equation [8.3.14] is a special case of [8.3.25] with p = 1, V, = 1/(1 — p”), and 
L, = V1 — p?. 

If we are willing to condition on the first p observations, the task is to choose 
B and p,, p2,... , p, So as to minimize 


T 


> 7 lo. — x:B) — pilye-1 — X¢-1B) — pol(¥r-2 — X/~2B) 


t=pt 


2 
Sy Pp(Yi-p ~ x!-B)| : 


Again, in the absence of lagged endogenous variables we can iterate as in Cochrane- 
Orcutt, first taking the p,’s as given and regressing y, on X,, and then taking B as 
given and regressing @, on @,_,, @,2,..., tp. 

Any covariance-stationary process for the errors can always be approximated 
by a finite autoregression, provided that the order of the approximating auto- 
regression (p) is sufficiently large. Amemiya (1973) demonstrated that by letting 
Pp go to infinity at a slower rate than the sample size T, this iterated GLS estimate 
will have the same asymptotic distribution as would the GLS estimate for the case 
when V is known. Alternatively, if theory implies an ARMA(p, q) structure for 
the errors with p and q known, one can find exact or approximate maximum 
likelihood estimates by adapting the methods in Chapter 5, replacing u in the 
expressions in Chapter 5 with x;B. 


Further Remarks on Heteroskedasticity 


Heteroskedasticity can arise from a variety of sources, and the solution de- 
pends on the nature of the problem identified. Using logs rather than levels of 
variables, allowing the explanatory variables to enter nonlinearly in the regression 
equation, or adding previously omitted explanatory variables to the regression may 
all be helpful. Judge, Griffiths, Hill, and Lee (1980) discussed a variety of solutions 
when the heteroskedasticity is thought to be related to the explanatory variables. 
In time series regressions, the explanatory variables themselves exhibit dynamic 
behavior, and such specifications then imply a dynamic structure for the conditional 
variance. An example of such a model is the autoregressive conditional hetero- 
skedasticity specification of Engle (1982). Dynamic models of heteroskedasticity 
will be discussed in Chapter 21. 


8.3. Generalized Least Squares 227 


APPENDIX 8.A. Proofs of Chapter 8 Propositions 


m Proof of Proposition 8.2. The restricted estimate b* that minimizes [8.1.2] subject to 
[8.1.27] can be calculated using the Lagrangean: 


J = (1/2) b3 (y, — x'B)? + A(RB — r). [8.4.1] 


Here A denotes an(m x 1) vector of Lagrange multipliers; A, is associated with the constraint 
represented by the ith row of RB = r. The term j is a normalizing constant to simplify the 
expressions that follow. The constrained minimum is found by setting the derivative of 
[8.A.1] with respect to B equal to zero:'? 


7 = (1/2) 2 2(y, — x/B) a, =P) 


pT xR 


T 
= -> (y, — B'x)x! + VR = 0, 
rm 
or 
T zr. 
bY > xxi = > yxi -— VR. 
r= t= 
Taking transposes, 


T rd 
[2 xx | = > x,¥, — R'A 


ta tt 


v= [See][Ber)- [Zee] on mat 


tm rt 1 
= b — (X’X)"'RYA, 


where b denotes the unrestricted OLS estimate. Premultiplying [8.4.2] by R (and recalling 
that b* satisfies Rb* = r), 


Rb ~ r = R(X’X) "RA 
or 
\ = [R(X’X)“'R']-"(Rb — 1). [8.4.3] 
Substituting [8.A.3] into [8.A.2], 
b — b* = (X’X)-'R'[R(X’X)“'R’]-'(Rb — r). [8.4.4] 
Notice from [8.A.4] that 
(b — b*)'(X'X)(b — b*) = {(Rb — r)'[R(X’X)~'R']-'R(X'X) “"}(X'X) 
: x {(X'X)?R'[R(X’'X)“7R’]-'(Rb ~ n)} 
= (Rb - r)'[R(X’X)-*R']-"[R(X’'X)“"R’] [8.4.5] 
x [R(X’X)'R']-"(Rb — r) 
= (Rb — r)'[R(X’X)“'R']-"(Rb — rv). 


Thus, the magnitude in [8.1.32] is numerically identical to 


pq (a BYX'X(D — b*m _ (b — b*)'X'X(b — b*)/m 
s = . 


RSS\T — k) 


Comparing this with [8.1.37], we will have completed the demonstration of the equivalence 
of [8.1.32] with [8.1.37] if it is the case that 


RSS, — RSS, = (b — b*)'(X'X)(b — b”*). [8.4.6] 


'2We have used the fact that ax;B/aB’ = x,. See the Mathematical Review (Appendix A) at the 
end of the book on the use of derivatives with respect to vectors. 


228 Chapter 8 | Linear Regression Models 


Now, notice that 
RSS, = (y — Xb*)'(y — Xb*) 
= (y — Xb + Xb — Xb*)'(y — Xb + Xb — Xb*) [8.A.7] 
= (y — Xb)'(y — Xb) + (b — b*)’X'X(b — b*), 


where the cross-product term has vanished, since (y — Xb)’X = 0 by the least squares 
property [8.1.10]. Equation [8.A.7] states that 


RSS, = RSS, + (b — b*)'X'X(b — b*), [8.4.8] 
confirming [8.4.6]. ™ 


@ Proof of Proposition 8.3. Assumption 8.6(e) guarantees that 0;> Q, so the issue is 
whether 0, gives a consistent estimate of 2. Define QF = (1/7)27,u2x,x/, noting that 
OF converges in probability to © by Assumption 8. 6(c). Thus, if we can show that 
O, - 2¢40, then 2,4 2. Now, 


0, - OF = WUT) D (a ~ wea. [8.4.9] 


But 
(a? — u?) = (a, + u,)(a, — u,) 
= [(y, — brx,) + (y, — B’x,)IICy ~ brx,) — (y, - B’x,)] 
[2(y, — B’x.) — (by - B)'x,][-(br — B)‘x.] 
~2u,(br — B)'x, + [(by — B)'x,]?, 
allowing [8.A.9] to be written as 


T T 
Q, ~ OF = (-2T) Z ulby ~ BY'x(eex;) + (WT) D [br — B)'xPGx). [8.4.10] 
The first term in [8.4.10] can be written 
T k T 
(=2/T) 2 u(r — B)'x,(%.x,) = ~2 B (br a arm 2 uu) [8.4.11] 


The second term in [8.A.11] has a finite plim by Assumption 8.6(e), and (b,, ~ B,) > 0 
for each i. Hence, the probability limit of [8.4.11] is zero. 
Turning next to the second term in [8.A.10], 


(UT) D [br ~ BY'xPOxx1) = DD Or ~ Br ~ e)fan 2 nm(4X) | 


which again has plim zero. Hence, from [8.4.10], 
0,-0:50. = 


m Proof of Proposition 8.4. Recall from [8.2.6] that 


VT(b; ~ B) = [un 3 2 3] [wn 2 xa 
= [wn 3 (x, — px,-.)(x, ~ m.)' [8.4.12] 


T 
x [wn (x, y Bx,_1)(u, a m)}. 
We will now show that [(1/T)27,(x, - VO px,-1)‘] has the same plim as 


[Q/T)Z7L (x, ~ PX, ~ W% > PX, ) | and that aid T)Zh x, px,- Wl ~ pu,_)] has 
the same asymptotic distribution as [(1/V7T)2Z. ,(x, ~ px “ay - pu,_,)]. 


Appendix 8.A. Proofs of Chapter 8 Propositions 229 


Consider the first term in [8.A.12]: 
T 
(UT) De ~ BX a) ~ BR-1)' 
T 
= (1/T) py [x ~ PX,, + (p ~ P)x,-:]Lx, ~ PX, + (e = p)x,-1]' 
T 
= (UT) 2B (% ~ px) % = PX 1’ 
T 
+ (B~ BY (UT) & (% ~ PR -a)Xi-1 [8.4.13] 
T 
+ (p ~ p)-(1/T) 2 X-1(% ~ PX,-1)" 
T 
+ (p ~ p)(1/T) 2 X,-1 Xp 
But (¢ ~ p)-> 0, and the plims of (1/7)E7.,x,_,x/_, and (1/T)Z7,x,x/_, are assumed to 


exist. Hence [8.A.13] has the same plim as (1/T) 27 ,(x, ~ px,-1)(%, ~ px,-)'. 
Consider next the second term in [8.A.12]: 


(IVT) D 0%, ~ Be ~ Per) 
= (W/VT) bs [x, — PX-1 + (9 — Ax illu, ~ pup-1 + (0 ~ p)u,-,)  [8.A.14] 
= WVT) S06 ~ pres)(es~ pt) 
+ VI - pam > X,o(u, ~ pu] 
+ VT - aan De- px. 


+ VT - pan D» sts} 


But [8.3.21] established that V7(p — A) converges in distribution to a stable random vari- 
able. Since plim[(1/T) 27_,x,u,] = 0, the last three terms in [8.4.14] vanish asymptotically. 
Hence, 


(V/VT) 2 (x, I bx, 1)(u, = pu,_;)> (1/VT) > (x, ae PX,~1)(u, _ pu,_), 


which was to be shown. 


Chapter 8 Exercises 


8.1. Show that the uncentered R2 [8.1.13] can equivalently be written as 


a [e a) i (2) 


for 2, the OLS sample residual [8.1.4]. Show that the centered R2 can be written as 


Rei- e a) + (3 v,- 7») 


230) Chanter 2 | Linear Reoressinn Madole 


8.2. Consider a null hypothesis H, involving m = 2 linear restrictions on B. How large a 
sample size T is needed before the 5% critical value based on the Wald form of the OLS 
F test of Hy is within 1% of the critical value of the Wald form of the OLS x? test of H,? 


8.3. Derive result [8.2.28]. 
8.4. Consider a covariance-stationary process given by 


y, = bh a 2 WiE,—j. 


where {e,} is an i.i.d. sequence with mean zero, variance ¢”, and finite fourth moment and 


where 27. |y;| < o. Consider estimating a Be autoregression by OLS: 


Ye = C+ by + Goya tc t+ PpYi~p + Us 


Show that the OLS coefficients give consistent estimates of the population parameters that 
characterize the linear projection of y, on a constant and p of its lags—that is, the coefficients 


give consistent estimates of the parameters c, ¢,,..., ¢, defined by 

Evyly.-1 Vrmarees >Ye-p) =cet Piyr~1 + $2Y1-2 a eas bpy:~p 
(HINT: Recall that c, d,,. . . , ¢, are characterized by equation [4.3.6]). 
Chapter 8 References 


Amemiya, Takeshi. 1973. “Generalized Least Squares with an Estimated Autocovariance 
Matrix.” Econometrica 41:723-32. 

Anderson, T. W. 1971. The Statistical Analysis of Time Series. New York: Wiley. 
Betancourt, Roger, and Harry Kelejian. 1981. “Lagged Endogenous Variables and the 
Cochrane-Orcutt Procedure.” Econometrica 49:1073—78. 

Brillinger, David R. 1981. Time Series: Data Anaysis and Theory, expanded ed. San Fran- 
cisco: Holden-Day. 

Durbin, James. 1960. “Estimation of Parameters in Time-Series Regression Models.” Jour- 
nal of the Royal Statistical Society Series B, 22:139-53. 

. 1970. “Testing for Serial Correlation in Least-Squares Regression When Some of 
the Regressors Are Lagged Dependent Variables.’’ Econometrica 38:410-21. 

Eicker, F. 1967. “Limit Theorems for Regressions with Unequal and Dependent Errors.” 
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 
Vol. 1, pp. 59-62. Berkeley: University of California Press. 

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of 
the Variance of United Kingdom Inflation.” Econometrica 50:987-1007. 

Evans, G. B. A., and N. E. Savin. 1981. “Testing for Unit Roots: 1. Econometrica 49:753- 
79. 

Flavin, Marjorie A. 1983. “Excess Volatility in the Financial Markets: A Reassessment of 
the Empirical Evidence.” Journal of Political Economy 91:929-56. 

Gregory, Allan W., and Michael R. Veall. 1985. “Formulating Wald Tests of Nonlinear 
Restrictions.” Econometrica 53:1465—68. 

Hansen, Lars P. 1982. “Large Sample Properties of Generalized Method of Moments Es- 
timators.”” Econometrica 50:1029-54. 

Harvey, A. C. 1981. The Econometric Analysis of Time Series. New York: Wiley. 
Hausman, Jerry A., and William E. Taylor. 1983. “Identification in Linear Simultaneous 
Equations Models with Covariance Restrictions: An Instrumental Variables Interpretation.”” 
Econometrica 51:1527-49. 

Imhof, J. P. 1961. “Computing the Distribution of Quadratic Forms in Normal Variables.” 
Biometrika 48:419-26. 

Judge, George G., William E. Griffiths, R. Carter Hill, and Tsoung-Chao Lee. 1980. The 
Theory and Practice of Econometrics. New York: Wiley. 

Kinderman, A. J., and J. G. Ramage. 1976. “Computer Generation of Normal Random 
Variables.” Journal of the American Statistical Association 71:893-96. 

Lafontaine, Francine, and Kenneth J. White. 1986. “Obtaining Any Wald Statistic You 
Want.” Economics Letters 21:35-40. 


Chantor 2 Roforanros V%1 


Maddala, G. S. 1977. Econometrics. New York: McGraw-Hill. 


Newey, Whitney K., and Kenneth D. West. 1987. “A Simple Positive Semi-Definite, Heter- 
oskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55:703-8. 


Nicholls, D. F., and A. R. Pagan. 1983. “Heteroscedasticity in Models with Lagged De- 
pendent Variables.” Econometrica 51:1233-42. 


O’Nan, Michael. 1976. Linear Algebra, 2d ed. New York: Harcourt Brace Jovanovich. 


Phillips, P. C. B., and Joon Y. Park. 1988. ‘On the Formulation of Wald Tests of Nonlinear 
Restrictions.” Econometrica 56:1065—83. 


Rao, C. Radhakrishna. 1973. Linear Statistical Inference and Its Applications, 2d ed. New 
York: Wiley. 


Theil, Henri. 1971. Principles of Econometrics. New York: Wiley. 


White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and 
a Direct Test for Heteroskedasticity.” Econometrica 48:817-38. 


. 1984. Asymptotic Theory for Econometricians. Orlando, Fla.: Academic Press. 


232 Chanter 28 | Linear Reor ssian Models 


Linear Systems 
of Simultaneous Equations 


The previous chapter described a number of possible departures from the ideal 
regression model arising from errors that are non-Gaussian, heteroskedastic, or 
autocorrelated. We saw that while these factors can make a difference for the small- 
sample validity of t and F tests, under any of Assumptions 8.1 through 8.6, the 
OLS estimator b; is either unbiased or consistent. This is because all these cases 
retained the crucial assumption that u,, the error term for observation ¢, is uncor- 
related with x,, the explanatory variables for that observation. Unfortunately, this 
critical assumption is unlikely to be satisfied in many important applications. 

Section 9.1 discusses why this assumption often fails to hold, by examining a 
concrete example of simultaneous equations bias. Subsequent sections discuss a 
variety of techniques for dealing with this problem. These results will be used in 
the structural interpretation of vector autoregressions in Chapter 11 and for under- 
standing generalized method of moments estimation in Chapter 14. 


9.1. Simultaneous Equations Bias 


To illustrate the difficulties with endogenous regressors, consider an investigation 
of the public’s demand for oranges. Let p, denote the log of the price of oranges 
in a particular year and q? the log of the quantity the public is willing to buy. To 
keep the example very simple, suppose that price and quantity are covariance- 
stationary and that each is measured as deviations from its population mean. The 
demand curve is presumed to take the form 


qi = Bp, + 7, [9.1.1] 


with B < 0; a higher price reduces the quantity that the public is willing to buy. 
Here e? represents factors that influence demand other than price. These are 
assumed to be independent and identically distributed with mean zero and variance 
a3. 

The price also influences the supply of oranges brought to the market, 


qi = yw + et, [9.1.2] 


where y > 0 and e represents factors that influence supply other than price. These 
omitted factors are again assumed to be i.i.d. with mean zero and variance o7?, 
with the supply disturbance e{ uncorrelated with the demand disturbance 7. 
Equation [9.1.1] describes the behavior of buyers of oranges, and equation 
[9.1.2] describes the behavior of sellers. Market equilibrium requires g? = q%, or 


Bp, + ef = YP; + &. 


233 


Rearranging, 


= 9.1.3 
P; y-B [ ] 
Substituting this back into [9.1.2], 
d s 
— . & —~ & Y d B 
= yaa tat= ef — ef, 9.1.4 
q Y = Bh eS ages ope ae [ ] 


Consider the consequences of trying to estimate [9.1.1] by OLS. A regression 
of quantity on price will produce the estimate 


T 
(VT) > v4 
bp = ———}_. (9.1.5] 
(1/T) >» p? 


sae [9.1.3] and [9.1.4] into the numerator in [9.1.5] results in 


= _1¢ 1 d_ 1 s peee ee 

2S pai +e [oe to e||Ge- Et a 
ig d\2 +B! =, ef) — ete 
-33 ote Be + GY - Gop et 
pe you + Boy 


id r 1 ; 
iE et-23 | 1 t-te] 2, 244 oF 


t=1 TimblLy- Bs 1-68 (y - B)? 
Hence, 
»foi+o “ps + bei] _ 703 + Bo? 
aed ora cee a ee re 


OLS regression thus gives not the demand elasticity 6 but rather an average 
of 8 and the supply elasticity y, with weights depending on the sizes of the variances 
o3 and o?. If the error in the demand curve is negligible (03 — 0) or if the error 
term in the supply curve has a big enough variance (7? > ©), then [9.1.6] indicates 
that OLS would give a consistent estimate of the demand elasticity 8. On the other 
hand, if 3 ~ © or a2 > 0, then OLS gives a consistent estimate of the supply 
elasticity y. In the cases in between, one economist might believe the regression 
was estimating the demand curve [9.1.1] and a second economist might perform 
the same regression calling it the supply curve [9.1.2]. The actual OLS estimates 
would represent a mixture of both. This phenomenon is known as simultaneous 
equations bias. 

Figure 9.1 depicts the problem graphically.’ At any date in the sample, there 
is some demand curve (determined by the value of €7) and a supply curve (deter- 
mined by ef), with the observation on (p,, q,) given by the intersection of these 
two curves. For example, date 1 may have been associated with a small negative 
shock to demand, producing the curve D,, and a large positive shock to supply, 
producing 5,. The date 1 observation will then be (p,, q;). Date 2 might have seen 


1Economists usually display these figures with the axes reversed from those displayed in Figure 9.1. 


234 Chapter 9 | Linear Systems of Simultaneous Equations 


P, 


FIGURE 9.1 Observations on price and quantity implied by disturbances to both 
supply functions and demand functions. 


a bigger negative shock to demand and a negative shock to supply, while date 3 
as drawn reflects a modest positive shock to demand and a large negative shock 
to supply. OLS tries to fit a line through the scatter of points {p,, q,}7.1. 

If the shocks are known to be due to the supply curve and not the demand 
curve, then the scatter of points will trace out the demand curve, as in Figure 9.2. 
If the shocks are due to the demand curve rather than the supply curve, the scatter 
will trace out the supply curve, as in Figure 9.3. 

The problem of simultaneous equations bias is extremely widespread in the 
social sciences. It is rare that the relation that we would like to estimate is the only 
possible reason why there might be a correlation among a group of variables. 


Consistent Estimation of the Demand Elasticity 


The above analysis suggests that consistent estimates of the demand elasticity 
might be obtained if we could find a variable that shifts the supply curve but not 
the demand curve. For example, let w, represent the number of days of below- 
freezing temperatures in Florida during year t. Recalling that the supply disturbance 
e} was defined as factors influencing supply other than price, w, seems likely to be 
an important component of ef. Define A to be the coefficient from a linear pro- 
jection of ef on w,, and write 


ef = hw, + uf. ; [9.1.7] 
Thus, uj is uncorrelated with w,, by the definition of A. Although Florida weather 
is likely to influence the supply of oranges, it is natural to assume that weather 


9.1, Simultaneous Equations Bias 235 


re) Py 


FIGURE9.2 Observations on price and quantity implied by disturbances to supply 
function only. 


P 


FIGURE 9.3 Observations on price and quantity implied by disturbances to de- 
mand function only. 


236 Chapter 9 | Linear Systems of Simultaneous Equations 


matters for the public’s demand for oranges only through its effect on the price. 
Under this assumption, both w, and u? are uncorrelated with «7. Changes in price 
that can be attributed to the weather represent supply shifts and not demand shifts. 

Define p,* to be the linear projection of p, on w,. Substituting [9.1.7] into 
[9.1.3], 


e? — hw, — us 


= ; 9.1.8 
P: y= B [ ] 
and thus 
—h 
BP pe [9.1.9] 
since e4 and us are uncorrelated with w,. Equation [9.1.8] can thus be written 
e@ -—- us 
P. = pi + ear 
and substituting this into [9.1.1], 
» , -u 
a: = BYP: weep aes Bp? + ¥, [9.1.10] 


where 


ep PU SEE 
y-B v-B 
Since uf and ef are both uncorrelated with w,, it follows that v, is uncorrelated 


with p;*. Hence, if [9.1.10] were estimated by ordinary least squares, the result 
would be a consistent estimate of B: 


T 
(1/T) > PP4 

p= —§ 
—— pt 


(UT) > (pr? 


Vv, 


(UT) > pian? + v) 
 ——————— 
(UT) D (ps? ea 

(UT) > pry, 


(VT) > [er P 


+ B. 
The suggestion is thus to regress quantity on that component of price that is induced 
by the weather, that is, regress quantity on the linear projection of price on the 
weather. 

In practice, we will not know the values of the population parameters h, y, 
and f necessary to construct p/* in [9.1.9]. However, the linear projection p? can 
be consistently estimated by the fitted value for observation ¢ from an OLS regres- 
sion of p on w, 


B, = 8m, [9.1.12] 


9.1. Simultaneous Equations Bias 237 


where 
T 
(1/T) >» WPt 
= Sao 
(1/T) >) w? 


tal 


T 


The estimator [9.1.11] with p* replaced by J, is known as the two-stage least squares 
(2SLS) coefficient estimator: 


T 
; (VT) > B. 
Posts = ———F [9.1.13] 


(VT) > (6? 


Like B+, the 2SLS estimator is consistent, as will be shown in the following section. 


9.2. Instrumental Variables and Two-Stage Least Squares 


General Description of Two-Stage Least Squares 


A generalization of the previous example is as follows. Suppose the objective 
is to estimate the vector B in the regression model 


yr = B'z, + uy, (9.2.1] 


where z, is a (k Xx 1) vector of explanatory variables. Some subset n = k of the 
variables in z, are thought to be endogenous, that is, correlated with u,. The 
remaining k — n variables in z, are said to be predetermined, meaning that they 
are uncorrelated with u,. Estimation of B requires variables known as instruments. 
To be a valid instrument, a variable must be correlated with an endogenous ex- 
planatory variable in z, but uncorrelated with the regression disturbance u,. In the 
supply-and-demand example, the weather variable w, served as an instrument for 
price. At least one valid instrument must be found for each endogenous explanatory 
variable. 

Collect the predetermined explanatory variables together with the instruments 
in an (r X 1) vector x,. For example, to estimate the demand curve, there were 
no predetermined explanatory variables in equation [9.1.1] and only a single in- 
strument; hence, r = 1, and x, would be the scalar w, As a second example, 
suppose that the equation to be estimated is 


Yr = By + Bom + BsZze + ByZqy + BsZ5_ + Uy. 


In this example, z,, and z;, are endogenous (meaning that they are correlated with 
u,), Zz, and z3, are predetermined (uncorrelated with u,), and €,,, &,, and &5, are 
valid instruments (correlated with z,4, and zs, but uncorrelated with u,). Then r = 
6 and x; = (1, Za, 23:5 11, €2;, &,). The requirement that there be at least as many 
instruments as endogenous explanatory variables implies that r = k. 

Consider an OLS regression of z,, (the ith explanatory variable in [9.2.1]) on 


x; 
Zn = 8;X, + Cy. {9.2.2] 
The fitted values for the regression are given by 
2, = 8:x,, [9.2.3] 


238 Chapter 9 | Linear Systems of Simultaneous Equations 


where 


T -Ipgr 
6; = [= xxi| [= x, 


If z; is one of the predetermined variables, then z, is one of the elements of x, 
and equation [9.2.3] simplifies to 
Zu = Ze 


This is because when the dependent variable (z,,) is included in the regressors (x,), 
a unit coefficient on z, and zero coefficients on the other variables produce a 
perfect fit and thus minimize the residual sum of squares. 


Collect the equations in [9.2.3] fori = 1, 2,..., kin a(k x 1) vector 
equation 
2, = 8'x,, [9.2.4] 
where the (k x r) matrix 8' is given by 
3; 
, 8 T T -1 
s=] . [= [3 2xi|| 3 xxi| : [9.2.5] 
z t=1 taal 
8; 


The two-stage least squares (2SLS) estimate of B is found from an OLS regression 
of y, on 2,: 


T -Irgr 
Boss = [= 22 by 2} [9.2.6] 
An alternative way of writing [9.2.6] is sometimes useful. Let é, denote the 
sample residual from OLS estimation of [9.2.2]; that is, let 
Zy, = 8x, + by = 8y + by [9.2.7] 
OLS causes this residual to be orthogonal to x,: 
T 


a x,é;, = 0, 


t=1 


meaning that the residual is orthogonal to 2,,: 
T T 


m3 Zybn = 3; > x6, = 0. 


t= tal 


Hence, if [9.2.7] is multiplied by 2, and summed over #, the result is 
T T T 
D Ben = 2D Bulb + &u) = 2 28 
taal tal tal 
for all i and 7. This means that 
T T rae, 
> ay = Dd, 2,2, 
teal tal 
so that the 2SLS estimator [9.2.6] can equivalently be written as 


Bosrs = [= ta | [= 2», [9.2.8] 


9.2. Instrumental Variables and Two-Stage Least Squares 239 


Consistency of 2SLS Estimator 
Substituting [9.2.1] into [9.2.8], 


T “Ire 
Bosts.r = [> 22 [= Z,(z;B + «| 


roi 


T —Irr 
=B+ [> ia | b tau, 
t=1 = 


where the subscript T has been added to keep explicit track of the sample size T 
on which estimation is based. It follows from [9.2.9] that 


[9.2.9] 


Bosts.r — B = [wr 3 22] [ wr) S iu, [9.2.10] 


Consistency of the 25LS estimator can then be shown as follows. First note 
from [9.2.4] and [9.2.5] that 


T T 
(UT) DS 4,23 = 8 (1/T) >. x,2! 
tel t=1 


= [any y 2a || any 3S xxi| w y sai} 


=1 


[9.2.11] 


Assuming that the process (z,, x,) is covariance-stationary and ergodic for second 
moments, 


(VT) > 2,2; > Q, [9.2.12] 
where 
Q = (Ea, x) EG x)]- [E%2,)]- [9.2.13] 


Turning next to the second term in [9.2.10], 
T 2» T 
[any Ss iu | = 8: (UT) > x,u,. 
t=1 ral 
Again, ergodicity for second moments implies from [9.2.5] that 


& > [E(2,xi/)E(x.x!)) (9.2.14] 


while the law of large numbers will typically ensure that 
T 
(UT) > xu, > E(x) = 0, 
t=1 
under the assumed absence of correlation between x, and u,. Hence, 
T 
[wry > i| 40. [9.2.15] 
tal 


Substituting [9.2.12] and [9.2.15] into [9.2.10], it follows that 
Bosis,r — B> Q-*0 = 0. 


240 Chapter 9 | Linear Systems of Simultaneous Equations 


Hence, the 2SLS estimator is consistent as long as the matrix Q in [9.2.13] is 
nonsingular. 

Notice that if none of the predetermined variables is correlated with z,,, then 
the ith row of E(z,x,) contains all zeros and the corresponding row of Q in [9.2.13] 
contains all zeros, in which case 2SLS is not consistent. Alternatively, if z, is 
correlated with x, only through, say, the first element x,, and z, is also correlated 
with x, only through x,,, then subtracting some multiple of the ith row of Q from 
the jth row produces a row of zeros, and Q again is not invertible. In general, 
consistency of the 2SLS estimator requires the rows of E(z,x/) to be linearly in- 
dependent. This essentially amounts to the requirement that there be a way of 
assigning instruments to endogenous variables such that each endogenous variable 
has an instrument associated with it, with no instrument counted twice for this 
purpose. 


Asymptotic Distribution of 2SLS Estimator 
Equation [9.2.10] implies that 


VT(Bosts.r - B) = [ wn) >» 22 | awn) py tan [9.2.16] 
where 
T T 
lawn) x tu| = 8; (U/VT) > xu, 
Hence, from [9.2.12] and [9.2.14], 


VT (Bosts,r - B) > O- Eta MEta.xDI( GIT) »y x). (9.2.17] 


Suppose that x, is covariance-stationary and that {u,} is an i.i.d. sequence with 
mean zero and variance o* with u, independent of x, for all s = ¢. Then {x,u,} is a 
martingale difference sequence with variance-covariance matrix given by 
o?-E(x,x;). If u, and x, have finite fourth moments, then we can expect from 
Proposition 7.9 that 


T 
(aw Ss x, 5 N(0, o?-E(x,x’)). [9.2.18] 
tsl 
Thus, [9.2.17] implies that 
VT(Bosis,.r - B) > N(, V), [9.2.19] 
where 
V = QYEG x) EG x)]- fo? EO, x; EQ x) [E%,2,)]Q7! 
= 0Q-'-Q-Q7! {9.2.20] 
= o*Q™} : 
for Q given in [9.2.13]. Hence, 
Boses.r = N(B, (1/T)o7Q-}). [9.2.21] 


9.2. Instrumental Variables and Two-Stage Least Squares 241 


Since fas;5 is a consistent estimate of B, clearly a consistent estimate of the 
population residual for observation ¢ is afforded by 


dr = y, — 2: Boses.r—> Us. [9.2.22] 


Similarly, it is straightforward to show that o? can be consistently estimated by 
T 
&7 = (VT) 2 (% ~— 2Boszs.r)? [9.2.23] 
(see Exercise 9.1). Note well that although f,,,5 can be calculated from an OLS 


regression of y, on Z,, the estimates 2, and 6? in [9.2.22] and [9.2.23] are not based 
on the residuals from this regression: 


a #y,- & Basis 
T 
6 # (UT) > ( ~ & Basis)’ 
The correct estimates [9.2.22] and [9.2.23] use the actual explanatory variables z, 


rather than the fitted values 2,. 
A consistent estimate of Q is provided by [9.2.11]: 


T 
Q, = (1/T) >» 22) 


= [wn 2a || a7 3 xxi| wn) 3 sai, 


t=] 


[9.2.24] 


Substituting [9.2.23] and [9.2.24] into [9.2.21], the estimated variance-covariance 
matrix of the 25LS estimator is 


T ~1 
V,/T = eer an > 24 | 
tal 


-nf{See]Ee) Eo) 


A test of the null hypothesis RB = r can thus be based on 
(RBosis.r — 1)'[R(V,/T)R']- (RBaszs.r ~ ¥), [9.2.26] 


which, under the null hypothesis, has an asymptotic distribution that is y? with 
degrees of freedom given by m, where m represents the number of restrictions or 
the number of rows of R. 

Heteroskedasticity- and autocorrelation-consistent standard errors for 25LS 
estimation will be discussed in Chapter 14. 


[9.2.25] 


Instrumental Variable Estimation 
Substituting [9.2.4] and [9.2.5] into [9.2.8], the 2SLS estimator can be written as 


Tr a we 
Basis = [3 b'x2i| [3 bx, 


-(SE~] Eo} (Ems Eo)} 


[9.2.27] 
242 Chapter 9 | Linear Systems of Simultaneous Equations 


Consider the special case in which the number of instruments is exactly equal to 
the number of endogenous explanatory variables, so that r = k, as was the case 
for estimation of the demand curve in Section 9.1. Then 5242,x; is a (k X k) 
matrix and [9.2.27] becomes 


[> sx] » ro} [9.2.28] 


Expression [9.2.28] is known as the instrumental variable (IV) estimator. 
A key property of the /V estimator can be seen by premultiplying both sides 
of [9.2.28] by 57. ,x,z;: 


implying that 


> ¥(¥ — 2 Biv) = 0. [9.2.29] 


tl 


Thus, the /V sample residual (y, — z;B,,) has the property that it is orthogonal 
to the instruments x,, in contrast to the OLS sample residual (y, — 2;b), which is 
orthogonal to the explanatory variables z,. The JV estimator is preferred to OLS 
because the population residual of the equation we are trying to estimate (u,) is 
correlated with z, but uncorrelated with x,. 

Since the JV estimator is a special case of 2SLS, it shares the consistency 
property of the 2SLS estimator. Its estimated variance with i.i.d. residuals can be 
calculated from [9.2.25]: 


T “Ip fr T -1 
64] 3 xat| Pp wai || S 2] . [9.2.30] 


taal 


9.3. Identification 


We noted in the supply-and-demand example in Section 9.1 that the demand 
elasticity 8 could not be estimated consistently by an OLS regression of quantity 
on price. Indeed, in the absence of a valid instrument such as w,, the demand 
elasticity cannot be estimated by any method! To see this, recall that the system 
as written in [9.1.1] and [9.1.2] implied the expressions [9.1.4] and [9.1.3]: 


= ef — et 
un y-B y-B' 
d Ss 
ef — ef 
| oe ee 
‘y-B 


9.3. Identification 243 


If e4 and e§ arei.i.d. Gaussian, then these equations imply that the vector (q,, p,)‘ 
is Gaussian with mean zero and variance-covariance matrix 


0 = [1y - By}: ee es scene 


a, + Bor a3 + 2 


This matrix is completely described by three magnitudes, these being the variances 
of g and p along with their covariance. Given a large enough sample, the values 
of these three magnitudes can be inferred with considerable confidence, but that 
is all that can be inferred, because these magnitudes can completely specify the 
process that generated the data under the maintained assumption of zero-mean 
iid. Gaussian observations. There is no way to uncover the four parameters of 
the structural model (, y, 03, 72) from these three magnitudes. For example, the 
values (8, y, 03, 72) = (1, 2, 3, 4) imply exactly the same observable propertics 
for the data as would (f, y, 0%, 02) = (2, 1, 4, 3). 

If two different values for a parameter vector @ imply the same probability 
distribution for the observed data, then the vector 0 is said to be unidentified. 

When a third Gaussian white noise variable w, is added to the set of obser- 
vations, three additional magnitudes are available to characterize the process for 
observables, these being the variance of w, the covariance between w and p, and 
the covariance between w and q. If the new variable w enters both the demand 
and the supply equation, then three new parameters would be required to estimate 
the structural model—the parameter that summarizes the effect of w on demand, 
the parameter that summarizes its effect on supply, and the variance of w. With 
three more estimable magnitudes but three more parameters to estimate, we would 
be stuck with the same problem, having no basis for estimation of B. 

Consistent estimation of the demand elasticity was achieved by using two- 
stage least squares because it was assumed that w appeared in the supply equation 
but was excluded from the demand equation. This is known as achieving identi- 
fication through exclusion restrictions. 

We showed in Section 9.2 that the parameters of an equation could be esti- 
mated (and thus must be identified) if (1) the number of instruments for that 
equation is at least as great as the number of endogenous explanatory variables 
for that equation and (2) the rows of E(z,x;) are linearly independent. The first 
condition is known as the order condition for identification, and the second is 
known as the rank condition. 

The rank condition for identification can be summarized more explicitly by 
specifying a complete system of equations for all of the endogenous variables. Let 
y, denote an (n X 1) vector containing all of the endogenous variables in the 
system, and let x, denote an (m xX 1) vector containing all of the predetermined 
variables. Suppose that the system consists of m equations written as 


By, + Tx, = u,, [9.3.1] 
where B and I are (nm x n) and (nm X m) matrices of coefficients, respectively, and 
u, is an (nm X 1) vector of disturbances. The statement that x, is predetermined is 


taken to mean that E(x,u;) = 0. For example, the demand and supply equations 
considered in Section 9.1 were 


q, = Bp, + uf (demand) [9.3.2] 
q: = YP, + hw, + u; (supply). [9.3.3] 


244 Chapter 9 | Linear Systems of Simultaneous Equations 


For this system, there are m = 2 endogenous variables, with y, = (q,, p,)'; and 
m = 1 predetermined variable, so that x, = w,. This system can be written in the 


form of [9.3.1] as 
1 ~-B]} 4, 0 = ug 
Li allel *L ol la a4 


Suppose we are interested in the equation represented by the first row of the 
vector system of equations in [9.3.1]. Let yo, be the dependent variable in the first 
equation, and let y,, denote an (m, X 1) vector consisting of those endogenous 
variables that appear in the first equation as explanatory variables. Similarly, let 
X,, denote an (7, X 1) vector consisting of those predetermined variables that 
appear in the first equation as explanatory variables. Then the first equation in the 
system is 


Yor + Boryie + Poixa = Yor 


where Bo, is a (1 X n,) vector and Ty, is a (1 X m,) vector. Let y2, denote an 
(m2 X 1) vector consisting of those endogenous variables that do not appear in the 
first equation; thus, y} = (Yor Vie, Yar) and 1 + mn, + nm, = n. Similarly, let x,, 
denote an (m, X 1) vector consisting of those predetermined variables that do not 
appear in the first equation, so that x; = (xj,, X2,) and m, + m, = m. Then the 
system in [9.3.1] can be written in partitioned form as 


1 Bo 9" 7][ Yo To; 0’ : Uo 
By Bu Bylfy¥un] + [Pi Tr | i = | Uy]. [9.3.5] 
By Ba, BazJLYa Tr. Tz 7 Ua, 


Here, for example, B,. is an (n; X m) matrix consisting of rows 2 through (n, + 1) 
and columns (7, + 2) through x» of the matrix B. 

An alternative useful representation of the system is obtained by moving Ix, 
to the right side of [9.3.1] and premultiplying both sides by B~?: 


y, = —B- Ix, + Bou, = II'x, + v,, [9.3.6] 

where 
Il’ = -B-'T [9.3.7] 
v, = Bou,. [9.3.8] 


Expression [9.3.6] is known as the reduced-form representation of the structural 
system [9.3.1]. In the reduced-form representation, each endogenous variable is 
expressed solely as a function of predetermined variables. For the example of 
[9.3.4], the reduced form is 


FE Tbk 
k a ~y} Lar} La -y} Lu 
-y BIO 
gs - mf? PI) ee 
+ ine - vif 2? Pl) 


= ee — aif + 18 - a] 


—yut + | 


—uf+ us 


9.3. Identification 245 


The reduced form for a general system can be written in partitioned form as 


Yor Ty, Woe 7 Vor 
yu} = [Wy Mall al + lvul, [9.3.10] 


Y2e Ih, Wy V2e. 


where, for example, II,, denotes an (; X m2) matrix consisting of rows 2 through 
(m, + 1) and columns (m, + 1) through m of the matrix II’. 

To apply the rank condition for identification of the first equation stated 
earlier, we would form the matrix of cross products between the explanatory var- 
iables in the first equation (x,, and y,,) and the predetermined variables for the 
whole system (x,, and x2,): 


_ eee ek [9.3.11] 


E(yuxi) E(Y¥uX2) 


In the earlier notation, the explanatory variables for the first equation consist of 
2, = (X1,, y1,)', while the predetermined variables for the system as a whole consist 
of x, = (x4,, x3,)'. Thus, the rank condition, which required the rows of E(z,x;) 
to be linearly independent, amounts to the requirement that the rows of the 
[(m, + n,) X m] matrix M in [9.3.11] be linearly independent. The rank condition 
can equivalently be stated in terms of the structural parameter matrices B and I 
or the reduced-form parameter matrix II. The following proposition is adapted 
from Fisher (1966) and is proved in Appendix 9.A at the end of this chapter. 


Proposition 9.1: If the matrix B in [9.3.1] and the matrix of second moments of the 
predetermined variables E(x,x}) are both nonsingular, then the following conditions 
are equivalent: 


(a) The rows of the [(m, + ny) X m] matrix M in [9.3.11] are linearly independent. 
(b) The rows of the [(n, + n.) X (mz + n,)] matrix 


Lp By 
ie | [9.3.12] 


are linearly independent. 
(c) The rows of the (n, X mz) matrix 11,2 are linearly independent. 


For example, for the system in [9.3.4], no endogenous variables are excluded 
from the first equation, and so yo, = q;, Yu = P:, and y2, contains no elements. 
No predetermined variables appear in the first equation, and so x,, Contains no 
elements and x,, = w,. The matrix in [9.3.12] is then just given by the parameter 
I’... This represents the coefficient on x2, in the equation describing y,, and is equal 
to the scalar parameter —h. Result (b) of Proposition 9.1 thus states that the first 
equation is identified provided that h # 0. The value of II,. can be read directly 
off the coefficient on w, in the second row of [9.3.9] and turns out to be given by 
hi(B — y). Since B is assumed to be nonsingular, (8 — y) is nonzero, and so T,, 
is zero if and only if II, is zero. 


Achieving Identification Through Covariance Restrictions 


Another way in which parameters can be identified is through restrictions on 
the covariances of the errors of the structural equations. For example, consider 


246 Chapter 9 | Linear Systems of Simultaneous Equations 


again the supply and demand model, [9.3.2] and [9.3.3]. We saw that the demand 
elasticity 8 was identified by the exclusion of w, from the demand equation. Con- 
sider now estimation of the supply elasticity y. 

Suppose first that we somehow knew the value of the demand elasticity @ 
with certainty. Then the error in the demand equation could be constructed from 


ur = 4: — Bp:. 
Notice that u¢ would then be a valid instrument for the supply equation [9.3.3], 
since ué@ is correlated with the endogenous explanatory variable for that equation 
(p,) but ué is uncorrelated with the error for that equation (us). Since w, is also 
uncorrelated with the error u5, it follows that the parameters of the supply equation 
could be estimated consistently by instrumental variable estimation with x, = 


(uf, w,)’: 
a ~1 
yr] _ | Zurp, Zurw,| | Zulg.| 2 | y [9.3.13] 
ht wp, Dwr =W/4, h 

where > indicates summation over ¢ = 1, 2,..., T. 


Although in practice we do not know the true value of f, it can be estimated 
consistently by JV estimation of [9.3.2] with w, as an instrument: 


B = (2w,p,)~(2W,4q,). 


Then the residual u¢ can be estimated consistently with 2¢ = g, — Bp,. Consider, 
therefore, the estimator [9.3.13] with the population residual u? replaced by the 


IV sample residual: 
ir] _ [2afp, 2atw,) "| Eafq, reat] 
hr 2W.P, Zw? =W.4: a 


It is straightforward to use the fact that B 4 B to deduce that the difference between 
the estimators in [9.3.14] and [9.3.13] converges in probability to zero. Hence, the 
estimator [9.3.14] is also consistent. 

Two assumptions allowed the parameters of the supply equation (y and h) 
to be estimated. First, an exclusion restriction allowed B to be estimated consis- 
tently. Second, a restriction on the covariance between u# and us was necessary. 
If u? were correlated with us, then u? would not be a valid instrument for the 
supply equation and the estimator [9.3.13] would not be consistent. 


Other Approaches to Identification 


A good deal more can be said about identification. For example, parameters 
can also be identified through the imposition of certain restrictions on parameters 
such as 8B; + B2 = 1. Useful references include Fisher (1966), Rothenberg (1971), 
and Hausman and Taylor (1983). 


9.4. Full-Information Maximum Likelihood Estimation 


Up to this point we have considered estimation of a single equation of the form 
y, = B’z, + u,. A more general approach is to specify a similar equation for every 
endogenous variable in the system, calculate the joint density of the vector of all 
of the endogenous variables conditional on the predetermined variables, and max- 
imize the joint likelihood function. This is known as full-information maximum 
likelihood estimation, or FIML. 


9.4. Full-Information Maximum Likelihood Estimation 247 


For illustration, suppose in [9.3.1] that the (” x 1) vector of structural dis- 
turbances u, for date tis distributed N(0, D). Assume, further, that u, is independent 
of u, for ¢ # 7 and that u, is independent of x, for all t and 7. Then the reduced- 
form disturbance v, = B~'u, is distributed N(0, B~-1D(B~')’), and the reduced- 
form representation [9.3.6] implies that 


y|x, ~ v(x, B'DB-'y) = n(-B-rs, BDe-)), 


The conditional log likelihood can then be found from 


¥(B, T, D) 


> log f(ylx,; B, FP, D) 
—(Tn/2) log(2m) — (T/2) log|B-"D(B-?)'| 
— (12) 3 fy. + BOP | B-DB-Y Hy, + BN). 


[9.4.1] 


But 
ly, + B-'Yx,]'[B~'D(B-')']~'[y, + B-'Tx,] 
= [y, + B-'I'x,]'[B’D-'B][y, + B~'Ix,] 
= [B(y, + B-'l'x,)]'D~"[B(y, + B-*Tx,)] [9.4.2] 
= [By, + Ix,]'D-'[By, + I'x,]. 
Furthermore, 
|B-'D(B-)'| = |B~}|-[D|-[B-1| (9.4.3] 
= |D|/BP. 


Substituting [9.4.2] and [9.4.3] into [9.4.1], 
£(B, T,D) = —(Tn/2) log(2m) + (7/2) logB? [9.4.4] 
T 
— (T/2) log|D| — (1/2) >) [By, + I'x,]'D~ By, + T'x,]. 
tal 


\ 
The FIML estimates are then the values of B, T, and D for which [9.4.4] is max- 
imized. 
For example, for the system of [9.3.4], the FIML estimates of B, y, h, 03, 
and o? are found by maximizing 


L£(B, y, h, 03, 02) 


T. |1 -p * 7. |o2 0 
= —Tlog(27) + 5 log 1 | = 3 log 0 @ 
is o3 0 a q:- BP. }} 
“324 {le pee wwf 0 4 q:— YP — hw, 
= —Tlog(27) + Tlog(y — 8) — (7/2) log(a3) [9.4.5] 


~ (1/2) log(o) ~ (1/2) 3, (a. - Be.Plo% 
~ (12) 3) (4. 9p. ~ flo? 


248 Chapter 9 | Linear Systems of Simultaneous Equations 


The first-order conditions for maximization are 


T 
> (4. = BP.)P, 
re vee 


ge  y-B a = 
T 
‘a 17h 
ae oT DAC) yp. — hw,)p, 
ay y-B a? . 


T 
>» (4, — YP: — hw,)w, 
—_— = —_—_— = 0 


oh o2 


T 
>» (a — Bp)? 


4 = aE. + ——— Sh 0 
aa4 203, 204 7 
T 
= = 2 
ay _ T 2, (4 — yp, — hw,) : 
ao2 sr? 20% 7 


[9.4.6] 


[9.4.7] 


[9.4.8] 


[9.4.9] 


[9.4.10] 


The last two equations characterize the maximum likelihood estimates of the 


variances as the average squared residuals: 


T 
64 = (VT) > (a, ~ Bp 


T 
63 = (VT) D (a- 9p, — hw). 
Multiplying equation [9.4.7] by (@ — )/T results in 


T 
O=-1+ » (4: — yp, — hw,)(Bp, — yp,)/(To?) 


T 
=-1+ »y (q. — yp. — hw,)(Bp, — Qe + a — yp)(To?). 


If [9.4.8] is multiplied by 4/T and subtracted from [9.4.13], the result is 


[9.4.11] 


[9.4.12] 


[9.4.13] 


T 
O=-1+ Py (Q — We — hw)(BPr — 9 + 9 — yp, — hw,)/(To?) 
= 


: . 
-1+4 »y (4: — ye — hw,)(Bp: — 9))(To?) 


T 
+ 2 (q: — YW, — hw,?/(To?) 


by virtue of [9.4.12]. Hence, the MLEs satisfy 


T 
Dy (4 — WP: - hw.)(q, — Bp.) = 9. 
f= 


T 
a a » (q: — yP.- hw)(q, — Bp)(To?) + 1, 


[9.4.14] 


.4. Full-Infarmation Marvimum Tikelihand Fetimatinn %4Q 


Similarly, multiplying [9.4.6] by (y — 8)/T, 


0 


T 
-1+ >> (q, — Bp.)(w, — 9: + % — Bp) To3) 


T T 
-1- S (q - Bea, — yp) To) + > (q, -— Bp,)?(To?). 


tml 


Using [9.4.11], 
> (4: — Bp) a ~ ¥p.) = 0. [9.4.15] 
Subtracting [9.4.14] from [9.4.15], 
OF ay (a. — Bpd[a: — 9p) — (ae — ime — Awd] = h > (a. — Bp). 
Assuming that A + 0, the FIML estimate of 8 thus satisfies 
2 (4. — Bp.)w, = 05 


that is, the demand elasticity is chosen so as to make the estimated residual for 
the demand equation orthogonal to w,. Hence, the instrumental variable estimator 
By tums out also to be the FIML estimator. Equations [9.4.8] and [9.4.14] assert 
that the parameters for the supply equation (y and h) are chosen so as to make 
the residual for that equation orthogonal to w, and to the demand residual a? = 
q: — Bp,. Hence, the FIML estimates for these parameters are the same as the 
instrumental-variable estimates suggested in [9.3.14]. 

For this example, two-stage least squares, instrumental variable estimation, 
and full-information maximum likelihood all produced the identical estimates. This 
is because the model is just identified. A model is said to be just identified if for 
any admissible value for the parameters of the reduced-form representation there 
exists a unique value for the structural parameters that implies those reduced-form 
parameters. A model is said to be overidentified if some admissible values for the 
reduced-form parameters are ruled out by the structural restrictions. In an over- 
identified model, /V, 2SLS, and FIML estimation are not equivalent, and FIML 
typically produces the most efficient estimates. 

For a general overidentified simultaneous equation system with no restrictions 
on the variance-covariance matrix, the FIML estimates can be calculated by iter- 
ating on a procedure known as three-stage least squares; see, for example, Maddala 
(1977, pp. 482-90). Rothenberg and Ruud (1990) discussed FJML estimation in 
the presence of covariance restrictions. FIML estimation of dynamic time series 
models will be discussed further in Chapter 11. 


9.5 Estimation Based on the Reduced Form 


If a system is just identified as in [9.3.2] and [9.3.3] with u@ uncorrelated with us, 
one approach is to maximize the likelihood function with respect to the reduced- 
form parameters. The values of the structural parameters associated with these 
values for the reduced-form parameters are the same as the FIML estimates in a 
just-identified model. 


250 Chapter 9 | Linear Systems of Simultaneous Equations 


The log likelihood [9.4.1] can be expressed in terms of the reduced-form 
parameters II and © as 


LIL, ©) >> log f(y,|x,; II, Q) 
—(Tn/2) log(27) — (T/2) log|Q| [9.5.1] 


T 
— (12) Sy, - W'xO-y, - Wx, 


where © = E(v,v,) = B~'D(B™')’. The value of II that maximizes [9.5.1] will be 
shown in Chapter 11 to be given by 


Il’ = [3 vail] 3 xx] 


t=] t=l 


in other words, the ith row of Il’ is obtained from an OLS regression of the ith 
endogenous variable on all of the predetermined variables: 


The MLE of O turns out to be 


= (1/T) [3 (y, — TI'x,)(y, - nny’. 


For a just-identified model, the FIML estimates are the values of (B, ', D) for 
which fl’ = -B~'T and @ = B-'D(B-!)'. 

We now show that the estimates of B, I’, and D inferred in this fashion from 
the reduced-form parameters for the just-identified supply-and-demand example 
are the same as the FIML estimates. The estimate 7, is found by OLS regression 
of g, on w,, while 7, is the coefficient from an OLS regression of p, on w,. These 
estimates satisfy 


: (4, — ww), = 0 [9.5.2] 
3 (Pp; — Fw,)w, = 0 [9.5.3] 
and 
Be al = (1/T) 2(q. — *w,) (qa, — tw.) - a 
om Qi 2(p, — 7W,)(q, — TW) - Z=(P; = tW,)? 
[9.5.4] 


The structural estimates satisfy BII’ = —¥ or 


EG Slee kb | [9.5.5] 


95 Estimation Rased on tho Reduced Farm 9&1 


Multiplying [9.5.3] by 6 and subtracting the result from [9.5.2] produces 
T 
O= 2 (4. — #w, — Bp, + Biw,)w, 
T T 
= >> (4. — Bp.)W, - 2 (, - Bi,)w? 
T 
= py (9, — BP): 
by virtue of the first row of [9.5.5]. Thus, the estimate of 8 inferred from the 


reduced-form parameters is the same as the IV or FIML estimate derived earlier. 
Similarly, multiplying [9.5.3] by y and subtracting the result from [9.5.2] gives 


T 
0= > (q. — iW, — YP, + yit2w.)™, 
7 
= >> CE — YP: — (ty, cs yit,)w,|w, 
T 
=, > [¢, — yp. — hw,]w,, 
by virtue of the second row of [9.5.5], reproducing the first-order condition [9.4.8] 


for F/ML. Finally, we need to solve D = BOB’ for D and y (the remaining element 
of B). These equations are 


oa 

0 o 
_{1 -eBl/ a, 4 1 1 
(i é][as dell, 2] 
1s 1 —-BI\ a. - 7, _«A * 1 1 
-334[} ~8l|e= It TW, p.- tow] 1 1] 
= i s { [2 — Bp, - (a in | c — Bp, - (4, ~ Zc 
T 


Ai (14: — YP: — Cty — ¥in)w, || ge — yD. — (i — vi2)W, 
ic = 
= >> {| a ~ BP Ia — BPr U7 YP: - ina 


The diagonal elements of this matrix system of equations reproduce the earlier 
formulas for the FIML estimates of the variance parameters, while the off-diagonal 
element reproduces the result [9.4.14]. 


9.6. Overview of Simultaneous Equations Bias 


The problem of simultaneous equations bias is extremely widespread in the social 
sciences. It is rare that the relation that we are interested in estimating is the only 
possible reason why the dependent and explanatory variables might be correlated. 
For example, consider trying to estimate the effect of military service on an indi- 
vidual’s subsequent income. This parameter cannot be estimated by a regression 
of income on a measure of military service and other observed variables. The error 


252 Chapter 9 | Linear Systems of Simultaneous Equations 


term in such a regression represents other characteristics of the individual that 
influence income, and these omitted factors are also likely to have influenced the 
individual’s military participation. As another example, consider trying to estimate 
the success of long prison sentences in deterring crime. This cannot be estimated 
by regressing the crime rate in a state on the average prison term in that state, 
because some states may have adopted stiffer prison sentences in response to higher 
crime. The error term in the regression, which represents other factors that influ- 
ence crime, is thus likely also to be correlated with the explanatory variable. 
Regardless of whether the researcher is interested in the factors that determine 
military service or prison terms or has any theory about them, simultaneous equa- 
tions bias must be recognized and dealt with. 

Furthermore, it is not enough to find an instrument x, that is uncorrelated 
with the residual u,. In order to satisfy the rank condition, the instrument x, must 
be correlated with the endogenous explanatory variables z,. The calculations by 
Nelson and Startz (1990) suggest that very poor estimates can result if x, is only 
weakly correlated with z,. 

Finding valid instruments is often extremely difficult and requires careful 
thought and a bit of good luck. For the question about military service, Angrist 
(1990) found an ingenious instrument for military service based on the institutional 
details of the draft in the United States during the Vietnam War. The likelihood 
that an individual was drafted into military service was determined by a lottery 
based on birthdays. Thus, an individual’s birthday during the year would be cor- 
related with military service but presumably uncorrelated with other factors influ- 
encing income. Unfortunately, it is unusual to be able to find such a compelling 
instrument for many questions that one would like to ask of the data. 


APPENDIX 9.A. Proofs of Chapter 9 Proposition 


@ Proof of Proposition 9.1. We first show that (a) implies (c). The middle block of [9.3.10] 
states that 


Yu = Myx, + Mixa, + Vy 


walk 
Yu 
= elf ele x2] + lex xi} [9.A.1] 


I 0 
=] E(x,x;), 
i =| ( ) 


since x, is uncorrelated with u, and thus uncorrelated with v,. 

Suppose that the rows of M are linearly independent. This means that 
[A' p’JM # 0 for any (m, x 1) vector A and any (m, x 1) vector p that are not both 
zero. In particular, [—’II,, j’]JM + 0’. But from the right side of [9.A.1], this implies 
that 


Hence, 


L, 0 ’ , ’ ’ ’ 
[-w'T, Bn’) : E(x,x;) = [0 B 11,2] E(x,x;) #0 
1, Wy, 


for any nonzero (n, X 1) vector yw. But this could be true only if w’TI,, + 0’. Hence, if the 
rows of M are linearly independent, then the rows of II,, are also linearly independent. 


Appendix 9.A. Proofs of Chapter 9 Proposition 253 


To prove that (c) implies (a), premultiply both sides of [9.A.1] by any nonzero vector 
(A‘ p’]. The side becomes 


0 , ' , 
[AO ow] E(x,x7) = (V+ wy) pM} E(% x7) = W E(Xx), 
a II, 


where 7’ = [(A' + p’II,,) p’TI,2]. If the rows of I,, are linearly independent, then 1’ 
cannot be the zero vector unless both p and X are zero. To see this, note that if pz is nonzero, 
then y’‘II,, cannot be the zero vector, while if ~ = 0, then y will be zero only if A is also 
the zero vector. Furthermore, since E(x,x/) is nonsingular, a nonzero yn means that 
1‘ E(x,x/) cannot be the zero vector. Thus, if the right side of [9.A.1] is premultiplied by 
any nonzero vector (A’, w’), the result is not zero. The same must be true of the left side: 
(A’ «JM # 0’ for any nonzero (A’, w’), establishing that linear independence of the rows 
of II,, implies linear independence of the rows of M. 
To see that (b) implies (c), write [9.3.7] as 


Ty, My, To 0° 
11, Ty.) = -Br'|T,, Ty). (9.A.2] 
TI, Iz Ty r2, 
We also have the identity 
1 0 0 1 By 0’ 
0 1, 0} = B-'|Byo B, B,z!- (9.A.3] 
o 0 L, B, B,, B,; 


The system of equations represented by the second block column of [9.A.2] and the third 
block column of [9.4.3] can be collected as 


TI. 0° 0 Oo 
II,, o|=B" -T i, B.2 . [9.4.4] 
II, I, Ty B., 


If both sides of (9.A.4] are premultiplied by the row vector (0 jj 0’] where p, is 
any (mn, X 1) vector, the result is 


0’ 0’ 
{will,, O'] = (0 pw; OJB-'|-L,. By, 
—-T., B. 
0’ 0’ 
=[Ao AL MT Br [9.A.5] 
-T,, By 


where 
[Ao AL Az} = [0 my O']Bo', 
implying 
(0 wi O'} = [Ag Ay ASIB. (9.A.6] 
’ Suppose that the rows of the matrix (F2 4] are linearly independent. Then the only 


values for A, and A, for which the right side of [9.A.5] can be zero are A, = 0 and A, = 
0. Substituting these values into [9.A.6], the ee value of 2, for which the left side of 
(9.A.5]} can be zero must satisfy 


(0 pi OJ = [Ay 0’ O'B 
= [Ag AoBor 9°]. 
Matching the first elements in these vectors implies Ag = 0, and thus matching the second 
elements requires x, = 0. Thus, if condition (b) is satisfied, then the only value of p., for 


254 Chapter 9 | Linear Systems of Simultaneous Equations 


which the left side of [9.4.5] can be zero is x, = 0, establishing that the rows of II,, are 
linearly independent. Hence, condition (c) is satisfied whenever (b) holds. 

Conversely, to see that (c) implies (b), let A, and A, denote any (n, x 1) and (m, X 1) 
vectors, and premultiply both sides of [9.4.4] by the row vector (0 Aj AJ|B: 


TI,, 0° 0’ 0’ 
(0 Aj, ABI, O}=([0 Aj As}-T,. B, 
Il, 1, -Ty B,, 
or 
t t Hex : ’ ’ Ty B,. 
[uo wi mal/T. Of = (A; Aj] _-r. B (9.A.7] 
Ty Ty, 2 ~B22 
where 
[#o Mi wa] = [0 Ay AQ]. [9.A.8] 


Premultiplying both sides of equation [9.A.4] by B implies that 
1 By O'U[T, 0’ 0 OO’ 
Bo By By =/-T, By}. 
-Ty By 


II,, 0 
Bio By B,, 
The upper left element of this matrix system asserts that 


I, I 


13, 


TI,. + Bo II,. = 0’. (9.A.9] 
Substituting [9.4.9] into [9.4.7], 
~Boll,, 0’ -r, B 
[Ho Mi Mall Me O7;= [i nal “T8 =| (9.A.10] 
II. I, = = 


In order for the left side of [9.A.10] to be zero, it must be the case that ~. = 0 and that 


—MoBo IT, + wi, = (wi — HoBo: IT, = 0’. (9.A.11] 
But if the rows of II,, are linearly independent, [9.4.11] can be zero only if 
Bi = KoBo. (9.A.12] 


Substituting these results into [9.A.8], it follows that [9.A.10} can be zero only if 
(0 AY AB = [zo poBo, 0°] 
1 By 0° 
= [zo 0’ O'}}Byo B, By [9.4.13] 
By Bz, B,, 
= [Ho 0° OB. 


Since B is nonsingular, both sides of [9.4.13] can be postmultiplied by B~! to deduce that 
(9.A.10] can be zero only if 


[0 Ai AM = [uo 0° Of) 


Thus, the right side of [9.4.10] can be zero only if A, and A, are both zero, establishing 
that the rows of the matrix in [9.3.12] must be linearly independent. m 


Chapter 9 Exercise 
9.1. Verify that [9.2.23] gives a consistent estimate of o?. 


Chapter 9 Exercise 255 


Chapter 9 References 


Angrist, Joshua D. 1990. “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence 
from Social Security Administration Records.” American Economic Review 80:313-36, 
Errata, 1990, 80:1284-86. 

Fisher, Franklin M. 1966. The Identification Problem in Econometrics. New York: McGraw- 
Hill. 

Hausman, Jerry A., and William E. Taylor. 1983. “Identification in Linear Simultaneous 
Equations Models with Covariance Restrictions: An Instrumental Variables Interpretation.” 
Econometrica 51:1527-49. 

Maddala, G. S, 1977. Econometrics. New York: McGraw-Hill. 

Nelson, Charles R,, and Richard Startz. 1990. “Some Further Results on the Exact Small 
Sample Properties of the Instrumental Variable Estimator.” Econometrica 58:967—76, 
Rothenberg, Thomas J. 1971. “Identification in Parametric Models.” Econometrica 39:577— 
91. 


and Paul A. Ruud. 1990. “Simultaneous Equations with Covariance Restrictions.” 
Journal of Econometrics 44:25-39. 


256 Chapter 9 | Linear Systems of Simultaneous Equations 


10 


Covariance-Stationary 
Vector Processes 


This is the first of two chapters introducing vector time series. Chapter 10 is devoted 
to the theory of multivariate dynamic systems, while Chapter 11 focuses on em- 
pirical issues of estimating and interpreting vector autoregressions. Only the first 
section of Chapter 10 is necessary for understanding the material in Chapter 11, 

Section 10.1 introduces some of the key ideas in vector time series analysis. 
Section 10.2 develops some convergence results that are useful for deriving the 
asymptotic properties of certain statistics and for characterizing the consequences 
of multivariate filters. Section 10.3 introduces the autocovariance-generating func- 
tion for vector processes, which is used to analyze the multivariate spectrum in 
Section 10.4. Section 10.5 develops a multivariate generalization of Proposition 
7.5, describing the asymptotic properties of the sample mean of a serially correlated 
vector process. These last results are useful for deriving autocorrelation- and het- 
eroskedasticity-consistent estimators for OLS, for understanding the properties of 
generalized method of moments estimators discussed in Chapter 14, and for deriving 
some of the tests for unit roots discussed in Chapter 17. 


10.1. Introduction to Vector Autoregressions 
Chapter 3 proposed modeling a scalar time series y, in terms of an autoregression: 
YW =O + BY -1 + bey -2 $+ + Oy —p + Ey [10.1.1] 
where 
E(e,) = 0 [10.1.2] 


o* fort=T 
= 10.1.3 
Etcie,) {o otherwise. 


Note that we will continue to use the convention introduced in Chapter 8 of using 
lowercase letters to denote either a random variable or its realization. This chapter 
describes the dynamic interactions among a set of variables collected in an (n 1) 
vector y,. For example, the first element of y, (denoted y,,) might represent the 
level of GNP in year t, the second element (y2,) the interest rate paid on Treasury 
bills in year t, and so on. A pth-order vector autoregression, denoted VAR(p), is 
a vector generalization of [10.1.1] through [10.1.3]: 


y,=e+ Py,_, + By, +--+ + By,_, + €,. . [10.1.4] 


Here ¢ denotes an (7 x 1) vector of constants and ®, an (m x n) matrix of 
autoregressive coefficients for j = 1,2,...,p. The (m x 1) vector €, is a vector 


257 


generalization of white noise: 


I 


E(e,) = 0 [10.1.5] 


E(€,€7) [10.1.6] 


a fort=T 
0 otherwise, 


with Q an (m X 1) symmetric positive definite matrix. 

Let c; denote the ith element of the vector ec and let oi) denote the row i, 
column j element of the matrix ®,. Then the first row of the vector system in 
[10.1.4] specifies that 


Vy = Cy + Ce Te + OP 21-1 ot ee PW Yantai 
+ DP Ya + OP yarn to + OD n a2 [10.1.7] 
a OP Yip + DP Yae-p 5 a PP Yat—p + Ey. 
Thus, a vector autoregression is a system in which each variable is regressed on a 
constant and p of its own lags as well as on p lags of each of the other variables 


in the VAR. Note that each regression has the same explanatory variables. 
Using lag operator notation, [10.1.4] can be written in the form 


| [L, -— ®L - ®,L2- +--+ - ©,L’ly,=e +e, 
or 
@(L)y, =c¢ + €,. 


Here ®(L) indicates an (2 x m) matrix polynomial in the lag operator L. The row 
i, column j element of ®(L) is a scalar polynomial in L: 


@(L) = [8 - @PL! - OPE? - -- - bP Le), 
where 6, is unity if i = j and zero otherwise. 

_A vector process y, is said to be covariance-stationary if its first and second 
moments (E[y,] and E[y,y;_;], respectively) are independent of the date ¢. If the 
process is covariance-stationary, we can take expectations of both sides of [10.1.4] 
to calculate the mean p of the process: 

B=c+ Opt Op +--+ Op, 
or 
pB=(1,- ®,- ®----- 1) te. 


Equation [10.1.4] can then be written in terms of deviations from the mean as 


(y, — B) = ®.(y-1 — B) [10.1.8] 
+ ®y,2 —p) +°°° + ®,(y,-p — B) + &. 


258 Chapter 10 | Covariance-Stationary Vector Processes 


Rewriting a VAR(p) as a VAR(1) 


As in the case of the univariate AR(p) process, it is helpful to rewrite [10.1.8] 
in terms of a VAR(1) process. Toward this end, define 


Yo— B 
ee [10.1.9] 
(up xX 3) : 
Yr-p+i -— Bp 
®, ®, ®, as ®,_, ®, 
rr 0 O -: 0 0 
F =/0 I, 0: 0 0 [10.1.10] 
(up X np) : : : tee : : 
0 0 0 :: LE, 0 
€, 
_|0 
vy, =]. 
(up x A : 
0 
The VAR(p) in [10.1.8] can then be rewritten as the following VA R(1): 
é& = F&-, + v,.- (10.1.11] 
where 
, _jQ  fort=7 
EON i otherwise 
and 
Qa 0 0 
0 0 0 
= . . 
(up x up) : sees 
00: 0 


Conditions for Stationarity 
Equation [10.1.11] implies that 
G5 = Vay + Faye) + Feva5-2 toe + Poly, + Fe. [10.1.12] 


In order for the process to be covariance-stationary, the consequences of any given 
€, must eventually die out. If the eigenvalues of F all lie inside the unit circle, then 
the VAR turns out to be covariance-stationary. 

The following result generalizes Proposition 1.1 from Chapter 1 (for a proof 
see Appendix 10.A at the end of this chapter). 


Proposition 10.1: The eigenvalues of the matrix ¥ in [10.1.10] satisfy 
|E,A? — BAP-! — BAe-2? — +--+ — | = 0. (10.1.13] 
Hence, a VAR(p) is covariance-stationary as long as |A| < 1 for all values of 


A satisfying [10.1.13]. Equivalently, the VAR is covariance-stationary if all values 
of z satisfying 


[I, — Biz — Bz? — +--+ — Bz] = 0 


lie outside the unit circle. 


10.1. Introduction to Vector Autoregressions 259 


Vector MA(~) Representation 


The first n rows of the vector system represented in [10.1.12] constitute a 
vector generalization of equation [4.2.20]: 


Yrs = Bt Sas + Wits + Wotes-2 to + Wye 
+ FiV(y, = B) a FR-1 ie B) A Sakecefe La 6 pee — p). 


Here W, = F(/) and F(/) denotes the upper left block of F/, where F/ is the matrix 
F raised to the jth power—that is, the (n X n) matrix F‘/) indicates rows 1 through 
nand columns 1 through n of the (np x np) matrix F/. Similarly, F‘4) denotes the 
block of F/ consisting of rows 1 through and columns (7 + 1) through 2n, while 
F(/) denotes rows 1 through ” and columns [n(p — 1) + 1] through np of F’. 

If the eigenvalues of F all lie inside the unit circle, then F* > 0 as s—> ~ and 
y, can be expressed as a convergent sum of the history of €: 


[10.1.14] 


y=prte, + Wie, + We. + Wye,-3 + -°- =p t+ W(L)e,, [10.1.15] 
which is a vector MA(*) representation. 

Note that y,_, is a linear function of €,_;, €,_;_,, ..., each of which is 
uncorrelated with ¢,,, for/ = 0,1,... . It follows that ¢,,, is uncorrelated with 
y,-; for any j = 0. Thus, the linear forecast of y,,, on the basis of y,, y,_,,. . . is 
given by 


Fie = + Bly, - BP) + ®y,-, — p) ++ + By par — 2), 


and €,,, can be interpreted as the fundamental innovation for y,,,, that is, the 
error in forecasting y,,, on the basis of a linear function of a constant and y,, y,_,, 
.... More generally, it follows from [10.1.14] that a forecast of y,,, on the basis 
of y,, y,-,...» Will take the form 


Sias = B+ FY, be #) zt F,-1 — p) 
apes of FO, pat — p). 


The moving average matrices W, could equivalently be calculated as follows. 
The operators ®(L) and W(L) are related by 


W(L) = [®(L)]"', 


{10.1.16] 


requiring 
(L, — ®L- ®,L?----— ©, Ll, + UL + WL? +--+) = 2, 


Setting the coefficient on L' equal to the zero matrix, as in Exercise 3.3 of Chapter 
3, produces 


w,- ©, =0. (10.1.17] 
Similarly, setting the coefficient on L? equal to zero gives : 
Ww, = ®,V, + &,, (10.1.18] 


and in general for L’, 
W, = ®.v,_, + ®,¥,_, Se ®,¥. 


pYs—p 
with W, = I, and W, = 0 fors <0. 
Note that the innovation in the MA(=) representation [10.1.15] is e,, the 
fundamental innovation for y. There are alternative moving average representations 
based on vector white noise processes other than ¢,. Let H denote a nonsingular 


fors = 1,2,..., [10.1.19] 


260 Chapter 10 | Covariance-Stationary Vector Processes 


(nm X n) matrix, and define 
u, = He,. (10.1.20] 
Then certainly u, is white noise. Moreover, from [10.1.15] we could write 
y, = p + H-'He, + V,H-'He,_, + W,H-'He,_, 
+ W3H-'He,_, +°-- (10.1.21] 
=pt Jou, + Jju,-,; + Jou. + Ju, +-°-, 
where 
J,=V,H"’. 


For example, H could be any matrix that diagonalizes ©, the variance-covariance 
matrix of €,: 


HOH’ = D, 


with D a diagonal! matrix. For such a choice of H, the elements of u, are uncorrelated 
with one another: 


E(uu/) = E(He,€;H’) = D. 


Thus, it is always possible to write a stationary VAR(p) process as a convergent 
infinite moving average of a white noise vector u, whose elements are mutually 
uncorrelated. 

There is one important difference between the MA(*) representations [10.1.15] 
and [10.1.21], however. In {10.1.15], the leading MA parameter matrix (W,) is the 
identity matrix, whereas in [10.1.21] the leading MA parameter matrix (J,) is not 
the identity matrix. To obtain the MA representation for the fundamental inno- 
vations, we must impose the normalization VW, = I,. 


Assumptions Implicit in a VAR 


For a covariance-stationary process, the parameters c and ®,,..., ®, in 
equation [10.1.4] could be defined as the coefficients of the projection of y, on a 
constant and y,_,,...,Y,—,. Thus, €, is uncorrelated with y,_,,...,y,-, by the 
definition of ®,,... , ®,. The parameters of a vector autoregression can accord- 
ingly be estimated consistently with n OLS regressions of the form of [10.1.7]. The 
additional assumption implicit in a VAR is that the e, defined by this projection is 
further uncorrelated with y,_,_,, y,-»-2, . - -. The assumption that y, follows a 
vector autoregression is basically the assumption that p lags are sufficient to sum- 
marize all of the dynamic correlations between elements of y. 


10.2. Autocovariances and Convergence Results 
for Vector Processes 
The jth Autocovariance Matrix 


For a covariance-stationary 7-dimensional vector process, the jth autocovar- 
iance is defined to be the following (nm x m) matrix: 


VT; = El(y, - w)(y-; - p)']- [10.2.1] 
Note that although y, = y_; for a scalar process, the same is not true of a vector 


10.2, Autocovariances and Convergence Results for Vector Processes 261 


process: 
41, 

For example, the (1, 2) element of I’; gives the covariance between y,, and y.,_,. 

The (1, 2) element of '_, gives the covariance between y,, and y.,,;. There is no 


reason that these should be related—the response of y, to previous movements in 
y, could be completely different from the response of y, to previous movements 


in y,. 
Instead, the correct relation is 


m=, [10.2.2] 


To derive [10.2.2], notice that covariance-stationarity would mean that ¢ in [10.2.1] 
could be replaced with ¢ + j: ; 


P= El(y; — MYern-7 — B)') = ElQes — HCY ~ B)']. 
Taking transposes, 
Vj = Elty, - ey, - we) =P-;, 


as claimed. 


Vector MA(q) Process 
A vector moving average process of order q takes the form 
y=pte, + Oe,_, + Oe. +--+: + Oe,_,, [10.2.3] 


where €, is a vector white noise process satisfying [10.1.5] and [10.1.6] and 0, 
denotes an ( X 7) matrix of MA coefficients for j = 1,2,...,q. The mean of 
y, is f, and the variance is 


t= El(ty - Hy, - )'] 
Efe,e;] + O,Efe,_,¢;- JO; + O,£[e,_2€/-2]03 


Ml 


aan et [10.2.4] 
+++ + @,E[e,_,£/-4]9; 
= 1 + 0,00; + 0,00; +--- + 0,20), 
with autocovariances 
0,0 + 6,00; + 0,,,90; +--- + 6,00;_, 
forj =1,2,....4q 
T=) 001, + 0,00" ,,, + 0,.901),,+--- + ,,,90! [10.2.5] 
for j = -1,-2,...,-4q 
0 for |j| > 4, 


where @, = I,. Thus, any vector MA(q) process is covariance-stationary. 


Vector MA(~) Process 
The vector MA(2) process is written 
y=pre, + Wye,_, + Wye,2+ °°: [10.2.6] 


for €, again satisfying [10.1.5] and [10.1.6]. 
A sequence of scalars {h,}7._. was said to be absolutely summable if 
=7__.|A,| <0. For H, an (n x m) matrix, the sequence of matrices {H,}*. _.. is 


262 Chapter 10 | Covariance-Stationary Vector Processes 


absolutely summable if each of its elements forms an absolutely summable scalar 
sequence. For example, if y{) denotes the row i, column j element of the moving 
average parameter matrix W, associated with lag s, then the sequence {W,}7_,, is 
absolutely summable if 


> WW |<© fori = 1,2,...,nandj=1,2,...,. [10.2.7] 
s=0) 


Many of the results for scalar MA(°-) processes with absolutely summable 
coefficients go through for vector processes as well. This is summarized by the 
following theorem, proved in Appendix 10.A to this chapter. 


Proposition 10.2: Let y, be an (n x 1) vector satisfying 


yY =p a 2, Wf, ¢, 


where €, is vector white noise satisfying [J0.].5] and [J0.J.6] and{W,}%., is absolutely 
summable. Let y, denote the ith element of y,, and let 4, denote the ith element of 
p. Then 


(a) the autocovariance between the ith variable at time t and the jth variable s 
periods earlier, E(y, — w)(Yju-. — wy), exists and is given by the row i, 
column j element of 


Tr, = > Ww, Ow! fors =0,1,2,...; 


ve) 
(b) the sequence of matrices {T,}7_ is absolutely summable. 


If, furthermore, {€,}/— .. is an i.i.d. sequence with E|€;,:€1,:€,06i,1| < © for iy, iz, 
iz,i, = 1,2,..., 7, then also, 


(c) EY ica inetd intr ¥ ist < © for iy, iz, 3, ty = 1,2, ..., a and forall t,, ty, 
ty ba; 


(2) (WUT) D Yi¥je-y > ECndjr-s) fori, j = 1,2,..;, and for all s. 
=i 


Result (a) implies that the second moments of an MA(s) vector process with 
absolutely summable coefficients can be found by taking the limit of [10.2.5] as 
q— ©. Result (b) is a convergence condition on these moments that will turn out 
to ensure that the vector process is ergodic for the mean (see Proposition 10.5 later 
in this chapter). Result (c) says that y, has bounded fourth moments, while result 
(d) establishes that y, is ergodic for second moments. 

Note that the vector MA(%) representation of a stationary vector autoregres- 
sion calculated from [10.1.4] satisfies the absolute summability condition. To see 
this, recall from [10.1.14] that W, is a block of the matrix F*. If F has np distinct 
eigenvalues (A;, Az,.. - , An»), then any element of WV, can be written as a weighted 
average of these eigenvalues as in equation [1.2.20]; 


WP = ey, Ay + Ca JAZ + + + CaplEs jeans ° 


where ¢,(i, /) denotes a constant that depends on y, i, and / but not s. Absolute 
summability [10.2.7] then follows from the same arguments as in Exercise 3.5. 


10.2, Autocovariances and Convergence Results for Vector Processes 263 


Multivariate Filters 
Suppose that the (” x 1) vector y, follows an MA(~) process: 
y= py + W(L)e,, [10.2.8] 


with {W,}%. absolutely summable. Let {H,}Z._. be an absolutely summable se- 
quence of (r X ) matrices and suppose that an (r x 1) vector x, is related to y, 
according to 


x, = H(L)y, = > He. ke [10.2.9] 
That is, 
x, = H(L)[py + (Le) 
= H(1)py + H(L)W(L)e, [10.2.10] 
= Px ee B(L)e,, 


where j., = H(1)py and B(L) is the compound operator given by 


B(L) = P2 B,L‘ = H(L)W(L). (10.2.11] 
The following proposition establishes that x, follows an absolutely summable two- 
sided MA(*)-process. 


Proposition 10.3: Let {W,};2. be an absolutely summable sequence of (n x n) 
matrices and let {H,}<. -. be an absolutely summable sequence of (r X n) matrices. 
Then the sequence of matrices {B,};. -.. associated with the operator B(L) = H(L)¥(L) 
is absolutely summable. 


If {e,} in [10.2.8] is i.i.d. with finite fourth moments, then {x,} in [10.2.9] has 
finite fourth moments and is ergodic for second moments. 


Vector Autoregression 


Next we derive expressions for the second moments for y, following a VAR(p). 
Let & be as defined in equation [10.1.9]. Assuming that & and y are covariance- 
stationary, let & denote the variance of &, 


% = E(&,€) 
yY — Bw 
=E YT B 
Yicpet — BP 
x (0, = p)’ (y-4 _ p)’ ans (Yi-pet ~ 7 
ry Yr, ate | 
r ry ey T,-2 
Uf ea: Hee EG [10.2.12] 


264 Chapter 10 | Covariance-Stationary Vector Processes 


where I, denotes the jth autocovariance of the original process y. Postmultiplying 
(10.1.11] by its own transpose and taking expectations gives 


E[é&é] = E((FE-, + v(FE—, + v)') = FEE FF + E(vv,). 
or 
x = FEF’ +Q. (10.2. 13] 


A closed-form solution to [10.2.13] can be obtained in terms of the vec op- 
erator. If A is an (m xX n) matrix, then vec(A) is an (mn x 1) column vector, 
obtained by stacking the columns of A, one below the other, with the columns 
ordered from left to right. For example, if 


then 


vec(A) = | 2 |. [10.2.14] 


Appendix 10.A establishes the following useful result. 


Proposition 10.4: Let A, B, and C be matrices whose dimensions are such that the 
product ABC exists. Then 


vec(ABC) = (C’ @ A) -vec(B) [10.2.15] 


where the symbol ® denotes the Kronecker product. 


Thus, if the vec operator is applied to both sides of [10.2.13], the result is 
vec(X) = (F @ F):vec(Z) + vec(Q) =  vec(Z) + vec(Q), [10.2.16] 
where 
4 = (F ®@ F). [10.2.17} 


Let r = np, so that F is an (r X r) matrix and o is an (r? X r?) matrix. 
Equation [10.2.16] has the solution 


vec(Z) = [I — s]7! vec(Q), (10.2.18] 


provided that the matrix [I,: — s4] is nonsingular. This will be true as long as unity 
is not an eigenvalue of s. But recall that the eigenvalues of F © F are all of the 
form A,A;, where A; and A; are eigenvalues of F. Since |A,| < 1 for all i, it follows 
that all eigenvalues of sf are inside the unit circle, meaning that [Ij — 4] is indeed 
nonsingular. 

The first p autocovariance matrices of a VAR(p) process can be calculated 


10.2. Autocovariances and Convergence Results for Vector Processes 265 


by substituting [10.2.12] into [10.2.18]: 
Fy ‘By: 38808 


1 
vec yi Fo ne Fp-2 = [I2 — 4]! vec(Q). — [10.2.19] 
eres Vee Ty 


The jth autocovariance of € (denoted %,) can be found by postmultiplying 
{10.1.11] by &_, and taking expectations: 


EG) = F E(&-.8-;) + EW &-)). 
Thus, 
Y= FE, forj=1,2,..., [10.2.20] 
or 
y= FE forj=1,2,.... {10.2.21] 


The jth autocovariance I’, of the original process y, is given by the first n rows and 
# columns of [10.2.20]: 


=O0,_,+ @0,.+°°°+@0-, forj=pp+lp+2,.... [10.2.22] 


10.3. The Autocovariance-Generating Function 
for Vector Processes 


Definition o of Autocovariance-Generating Function 
for Vector Processes 


Recall that for a covariance-stationary univariate process y, with absolutely 
summable autocovariances, the (scalar-valued) autocovariance-generating function 
8y(z) is defined as 


8r(z) = = yz! 


with 
y= E((y = BY; = 9) | 


and z a complex scalar. For a covariance-stationary vector process y, with an 
absolutely summable sequence of autocovariance matrices, the analogous matrix- 
valued autocovariance-generating function G,(z) is defined as 


Gy(z) = > jz, [10.3.1] 


i= ~x 
where 


Vr, = E[(y, — w)(y-; - py’) 


and z is again a complex scalar. 


266 Chapter 10 | Covariance-Stationary Vector Processes 


Autocovariance-Generating Function for a Vector Moving 
Average Process 


For example, for the vector white noise process ¢, characterized by [10.1.5] 
and [10.1.6], the autocovariance-generating function is 


G.(z) = ©. [10.3.2] 


For the vector MA(q) process of [10.2.3], the univariate expression [3.6.3] for the 
autocovariance-generating function generalizes to 

Gy(z) = (L, + O1z + @:z7 +--+ + 0,290 [10.3.3] 

x (E, + Ojz7! + Ohz727 + +++ + Ofz79), 
This can be verified by noting that the coefficient on z/ in [10.3.3] is equal to I; 
as given in [10.2.5]. 
For an MA(*) process of the form 
y= pt We + We, + We. +--+ = pt W(L)e,, 

with {W,}7_, absolutely summable, [10.3.3] generalizes to 


Gy(z) = [W(z)]O[W 2-9]. [10.3.4] 


Autocovariance-Generating Function 
for a Vector Autoregression 


Consider the VAR(1) process & = F&,_, + v, with eigenvalues of F inside 
the unit circle and with &, an (r x 1) vector and E(v,v,) = Q. Equation [10.3.4] 
implies that the autocovariance-generating function can be expressed as 
G,(z) = (I, — Fz]"'Q[L, - F'z-']"! 
= [I, + Fz + F2z? + F¥z*+---]Q [10.3.5] 
x [I, + (F)z7! + (F227? + (P8273 + ee]. 


Transformations of Vector Processes 


The autocovariance-generating function of the sum of two univariate proc- 
esses that are uncorrelated with each other is equal to the sum of their individual 
autocovariance-generating functions (equation [4.7.19]). This result readily gen- 
eralizes to the vector case: 


2 El(x, + W, — Bx — Pw) 


j=u-= 


x (x,-; + Wj; — Bx — Bw)’ ]z/ 


EUG ~ wa) =) — x2 


Gx.w(Z) 


+ 2 El(, — pw )(-1 — Bw)'2/] 


Gx(Z) + Gy(z). 


10.3. Autocovariance-Generating Function for Vector Processes 267 


Note also that if an (r x 1) vector & is premultiplied by a nonstochastic 
(n x r) matrix H', the effect is to premultiply the autocovariance by H’ and 
postmultiply by H: 


E((H’s, — H’p,)(H’E,-; — H’pe)'] = HE[(E — me)(E-; — me)'JH, 
implying 
Gy:(z) = H'G,(z)H. 


Putting these results together, consider &, the r-dimensional VAR(1) process 
& = F&_, + v, and a new process u, given by u, = H’é, + w, with w, a white 
noise process that is uncorrelated with &_, for all j. Then 


Gyu(z) = H'G,(z)H + Gy(2), 
or, if R is the variance of w,, 
Gy(z) = H’'[E, — Fz]-'Q[L, — F’z7']-'H + R. [10.3.6] 
More generally, consider an (n x 1) vector y, characterized by 
y= py + V(L)e,, 


where €, is a white noise process with variance-covariance matrix given by 2 and 
where W(L) = 27. >W,L* with {W,}7~, absolutely summable. Thus, the auto- 
covariance-generating function for y is 


Gy(z) = V(zZ)O[W(z- 4]. [10.3.7] 
Let {H, }7. _. be an absolutely summable sequence of (r x 1) matrices, and suppose 
that an (r x 1) vector x, is constructed from y, according to 
x, = H(L)y, = oe HY = Px + B(L)e,, 
where px = H(1)py and B(L) = H(L)W(L) as in [10.2.10] and [10.2.11]. Then 
the autocovariance-generating function for x can be found from 
Gx(z) = Bz) Q[B(z~')]' = [H@)¥@))O[%(z-)]'TH@-')'. [10.3.8] 


Comparing [10.3.8] with [10.3.7], the effect of applying the filter H(L) to y, is to 
premultiply the autocovariance-generating function by H(z) and to postmultiply 
by the transpose of H(z7'): 


Gx(z) = [H(z)]Gy(z)[H@~)’. [10.3.9] 


10.4. The Spectrum for Vector Processes 

Let y, be an (n x 1) vector with mean E(y,) = p and kth autocovariance matrix 
E[( y, — PY « -_ B)'] = ry. [10.4.1] 

If {I', }7.. -» is absolutely summable and if z is a complex scalar, the autocovariance- 


generating function of y is given by 


Gy(z) = > P,2*. [10.4.2] 


=x 


268 Chapter 10 | Covariance-Stationary Vector Processes 


The function G,(z) associates an (n x n) matrix of complex numbers with the 
complex scalar z. If [10.4.2] is divided by 2m and evaluated at z = e-, where 
w is a real scalar and i = V—1, the result is the population spectrum of the 
vector y: 


sy(w) = (27)-'Gy(e-*) = Qn)" DS Teer, [10.4.3] 


The population spectrum associates an (n x n) matrix of complex numbers with 
the real scalar w. 

Identical calculations to those used to establish Proposition 6.1 indicate that 
when any element of s,(w) is multiplied by e* and the resulting function of w is 
integrated from —7 to 7, the result is the corresponding element of the kth 
autocovariance matrix of y: 


[’ Sy(w)e™ dw = T,. [10.4.4] 


Thus, as in the univariate case, the sequence of autocovariances {f,}f_ _, and the 
function represented by the population spectrum sy(w) contain the identical in- 
formation. 

As a special case, when k = 0, equation [10.4.4] implies 


| . Midas Pi [10.4.5] 


In other words, the area under the population spectrum is the unconditional var- 
iance-covariance matrix of y. 

The jth diagonal element of P, is E(y,, — #)(Yj1-4 — my). the kth auto- 
covariance of y,,. Thus, the jth diagonal element of the multivariate spectrum Sy(w) 
is just the univariate spectrum of the scalar y,,. It follows from the properties of 
the univariate spectrum discussed in Chapter 6 that the diagonal elcments of sy(w) 
are real-valued and nonnegative for all w. However, the same is not true of the 
off-diagonal elements of sy(w)—in general, the off-diagonal elements of sy(w) will 
be complex numbers. 

To gain further understanding of the multivariate spectrum, we concentrate 
on the case of n = 2 variables, denoted 


The Ath autocovariance matrix is then 


(X, — bx MX 4% — Bx) (X) - Bx MY a8 - = 


Te & — py)(Xi-e — Bx) (¥, — wy)(Yink — By) 


be haceas [10.4.6] 
= Ee und 
Yk oy 
Recall from [10.2.2] that P; = P_,. Hence, 
Vk = yx? [10.4.7] 
yy = yn” [10.4.8] 
yr = ¥x". [10.4.9] 


10.4. The Spectrum for Vector Processes 269 


For this n = 2 case, the population spectrum [10.4.3] would be 


Sy(w) 
; > ype W ioe o> yileniek 
a on x : 
Le p> yVehen ink > yiile—ivk 
‘ DS y&Acos(wk) — i-sin(wk)} > SH c0s(wk) — i-sin(wk)} 
kw -x 
~ On 
‘ij oy _YiHeos(wk) — i-sin(wk)} 2 _y¥He0s(wk) — i-sin(wk)} 
[10.4.10] 


Using [10.4.7] and [10.4.8] along with the facts that sin(—wk) = —sin(wk) and 
sin(0) = 0, the imaginary components disappear from the diagonal terms: 


Sy(w) 


»> ys), cos(wk) > y¥Ycos(wk) — i-sin(wk)} 
ka-x kar 


oy > yVitcos(wk) — i-sin(wk)} 2 yyy cos(wk) 


(10.4.11] 


However, since in general y¥} # y\5*), the off-diagonal elements are typically 
complex numbers. 


The Cross Spectrum, Cospectrum, and Quadrature Spectrum 


The lower left element of the matrix in [10.4.11] is known as the population 
cross Spectrum from X to Y: 


Syx(w) = (27)7! Dy _vvxdcos(wk) — i-sin(wk)}. [10.4. 12] 


The cross spectrum can be written in terms of its real and imaginary components 
as 


Syx(w) = Cyy(w) + i-qyx(w). (10.4.13] 


The real component of the cross spectrum is known as the cospectrum between X 
and Y: 
Cyx(w) = (27)7! >) Wx cos(wk). [10.4.14] 
One can verify from [10.4.9] and the fact that cos(— wk) = cos(wk) that 
Cyx(w) = Cxy(w). [10.4.15] 


270 Chapter 10 | Covariance-Stationary Vector Processes 


The imaginary component of the cross spectrum is known as the quadrature spec- 
trum from X to Y: 


dyx(w) = -@n)-' > y¥% sin(wk). (10.4. 16] 


One can verify from [10.4.9] and the fact that sin(—wk) = —sin(wk) that the 
quadrature spectrum from Y to X is the negative of the quadrature spectrum from 
X to Y: 


dyx(w) = -qxy(w). 


Recalling [10.4.13], these results imply that the off-diagonal elements of sy(w) are 
complex conjugates of each other; in general, the row j, column m element of 
Sy(w) is the complex conjugate of the row m, column j element of sy(w). 

Note that both cy,(w) and gyx(w) are real-valued periodic functions of w: 


Cyx(w + 2nj) = cyx(w) forj = +1, +2,... 


dyx(w + 2nj) = qyx(w) forj = +1, +2,.... 


It further follows from [10.4.14] that 


Cyx(—) = Cyx(w), 


while [10.4.16] implies that 


Gyx(—4) = —qyx(w). [10.4.17] 
Hence, the cospectrum and quadrature spectrum are fully specified by the values 
they assume as w ranges between 0 and 7. 


Result [10.4.5] implies that the cross spectrum integrates to the unconditional 
covariance between X and Y: 


[7 sve) duo = E(Y, ~ my)(X, ~ mx): 


Observe from [10.4.17] that the quadrature spectrum integrates to zero: 


[’ dyx(w) dw = 0). 


Hence, the covariance between X and Y can be calculated from the area under 
the cospectrum between X and Y: 


| evel) deo = E(¥, — wy)(X, - wy) [10.4.18] 


The cospectrum between X and Y at frequency w can thus be interpreted as the 
portion of the covariance between X and Y that is attributable to cycles with 
frequency w. Since the covariance can be positive or negative, the cospectrum can 
be positive or negative, and indeed, cy(w) may be positive over some frequencies 
and negative over others. 


10.4. The Spectrum for Vector Processes 271 


The Sample Multivariate Periodogram 


To gain further understanding of the cospectrum and the quadrature spec- 
trum, let y,, y2,..., Yr and x,, x2, ..., x, denote samples of T observations on 
the two variables. If for illustration T is odd, Proposition 6.2 indicates that the 
value of y, can be expressed as 


y=yt > {a,-cos[w,(¢ — 1)] + 8, sin[w,(t - 1}. [10.4.19] 


where y is the sample mean of y, M = (T — 1)/2, w, = 2nj/T, and 


& = (2/T) > y, cos[w,(¢ — 1)] [10.4.20] 
& = (2/T) > y,-sin[w;(¢ — 1)). [10.4.21] 


An analogous representation for x, is 


x+ 3 {4,-cos[w,(¢ — 1)] + 4,-sin[w,(t - 1)]} [10.4.22] 


x, = 

4, = (2/T) y x,cos[w,(t — 1)] [10.4.23] 
t=1 

| a; = (2/T) 3 x,sin[w,(¢ — 1)]. [10.4.24] 


Recall from [6.2.11] that the periodic regressors in [10.4.19] all have sample mean 
zero and are mutually orthogonal, while 


T T , 
> cos*{w(t - 1)] = > sin’[w,(¢ - 1)] = 72. [10.4.25] 
t= =t 
Consider the sample covariance between x and y: 
T 
oa » (y, — y)(%, — Xx). [10.4.26] 


Substituting [10.4.19] and [10.4.22] into [10.4.26] and exploiting the mutual or- 
thogonality of the periodic regressors reveal that 


‘ay » (y — Y)(x, ~ x) 
ae > {3 {a;cos[w,(¢ — 1)] + §;- sin[wo;(¢ - 1)} 


x > {4;-cos[w,(t — 1)] + d,-sin[w,(t - op} [10.4.27] 


=T"! y {5 {4;4,-cos?[w,(t — 1)] + 6,4,-sin?[w,(¢ - vp} 
( 


272 Chapter 10 | Covariance-Stationary Vector Processes 


Hence, the portion of the sample covariance between x and y that is due to their 
common dependence on cycles of frequency w, is given by 


(1/2)(4,4, + 6,4)). [10.4.28] 


This magnitude can be related to the sample analog of the cospectrum with 
calculations similar to those used to establish result (c) of Proposition 6.2. Recall 
that since 

2 


> cos[w,(¢ — 1)] = 0, 


r= 


the magnitude 4, in [10.4.20] can alternatively be expressed as 


Q 


= QT) Y (, ~ D-costey(e ~ Y) 

Thus, 

(4, + i-d,)(a, ~ i-8) 
= (4/T?) {3 (x, — ¥)-cosfw,(¢ — 1)] + > (x, — ¥)-sin[w/(¢ - vi} 

x {3 0. - Preosta(e =D) = eS, = F-sinleyer — n)} 

= wt) {3 GB -ernti-w(t— 1} {S O,—F)-emnl-F0y6r- DI} 

= wr) { 3 Ge — (,- 7) +S Gs = Doves — -errlial 
+ 3G - DOr ~ eexplig) +S  ~ Hv — I)-expl-2a] 
+ 3 -D0e2- -ewpldial +++ + G1 Oe = V)-expl-(T~ Dio 
+ @r = DU - Frew - Hin] 

= (ty | 98) + 9-expl—io,] + 945" expo 
+ HP -exp[—2ia,] + ¥.?-exp[2ia,] + - 
+ Hr Prexp[—(T - lia] + 7" -exp[(T - lia, \. [10.4.29] 


where 7() is the sample covariance between the value of y and the value that x 
assumed k periods earlier: 


T-k 
(1/T) > (x, — DO. — 9) fork = 0,1,2)344,7=1 
te = ze 


T . 
WT) S ( -DHO4e-¥) fork = -1,-2,...,-T4+1. 
pu —k+1 
[10.4.30] 


10.4. The Spectrum for Vector Processes 273 


Result [10.4.29] implies that 


4(4, + i-d,)(a, — i-§) = (2/7) > ie exp[ — kiw,] 
~ (ani): fel w), 


[10.4.31] 


where S,,(w,) is the sample cross periodogram from x to y at frequency w,, or the 
lower left element of the sample multivariate periodogram: 


T-1t 
4K) 9 - iwk (kK) 4 — wk 
Vax OO Vex @ 
8 (w) = (27)7! ke -T+I oe ad S§y¢(@) §,,(w) 
: > atingniat Sake Sra) Sy(w) J 
> He e7iw > oan @ 7 fwk 
ke -T+l k=-T+! 


Expression [10.4.31] states that the sample cross periodogram from x to y at 
frequency w, can be expressed as 


) = [T/(87)]-(4 + i-a)(a, - i-8) 
= [TI(87)] (4,4, + 4;8) + i-[T8m)]- (4;4, — 4,8). 


The real component is the sample analog of the cospectrum, while the imaginary 
component is the sample analog of the quadrature spectrum: 


§,.(w)) = Ey (w)) + i+ Gyx(@)), (10.4.32] 

where 
E(w) = [T8m)] (4,4; + 4,5) [10.4.33] 
Gyx(@)) = [TH(8m)]-(d)4; ~ 4,8). [10.4.34] 


Comparing [10.4.33] with [10.4.28], the sample cospectrum evaluated at w, 
is proportional to the portion of the sample covariance between y and x that is 
attributable to cycles with frequency w, The population cospectrum admits an 
analogous interpretation as the portion of the population covariance between Y 
and X attributable to cycles with frequency w based on a multivariate version of 
the spectral representation theorem. 

What interpretation are we to attach to the quadrature spectrum? Consider 
using the weights in [10.4.22] to construct a new Series x* by shifting the phase of 
each of the periodic functions by a quarter cycle: 


xf axt 3 {4;-cos[w;(¢ — 1) + (a/2)] 
+ ne -sin[c;(¢ — 1) + (7/2)}}. 


The variable x* is driven by the same cycles as x,, except that at date ¢ = 1 each 
cycle is one-quarter of the way through rather than just beginning as in the case 
of x,. 

Since sin[@ + (7/2)] = cos(@) and since cos[@ + (7/2)] = —sin(6), the variable 
x* can alternatively be described as 


[10.4.35] 


xt =x > {d;-cos[w;(t — 1)] — 4;-sinfw(t - 1)}. [104.36] 


274 Chapter 10 | Covariance-Stationary Vector Processes 


As in [10.4.27], the sample covariance between y, and x* is found to be 
T M , 
T-! Dy ~ yer — ¥) = (12) D (6,4; — 4). 
= iF 


Comparing this with [10.4.34], the sample quadrature spectrum from x to y at 
frequency w,; is proportional to the portion of the sample covariance between x* 
and y that is due to cycles of frequency w,. Cycles of frequency w, may be quite 
important for both x and y individually (as reflected by large values for §,,(w) and 
§,,(w)) yet fail to produce much contemporaneous covariance between the variables 
because at any given date the two series are in a different phase of the cycle. For 
example, the variable x may respond to an economic recession sooner than y. The 
quadrature spectrum looks for evidence of such out-of-phase cycles. 


Coherence, Phase, and Gain 


The population coherence between X and Y is a measure of the degree to 
which X and Y are jointly influenced by cycles of frequency w. This measure 
combines the inferences of the cospectrum and the quadrature spectrum, and is 
defined as! 


= [cyx()? + [avx(w)? 
Syy(w)sxx(w) 


hyx(w) ’ 
assuming that syy(w) and syy(w) are nonzero. If sy,(w) or syx(w) is zero, the 
coherence is defined to be zero. It can be shown that 0 = hy,(w) < 1 for all w as 
long as X and Y are covariance-stationary with absolutely summable autocovariance 
matrices.” If hy(w) is large, this indicates that Y and X have important cycles of 
frequency w in common. 

The cospectrum and quadrature spectrum can alternatively be described in 
polar coordinate form. In this notation, the population cross spectrum from X to 
Y is written as 


Syx(w) = Cyx(w) + i-qyx(w) = R(w)-expli- 6(w)], [10.4.37] 


where 


R(w) = {[eyx(@)P + [ayx(w)?}'? [10.4.38] 


and 6(w) represents the radian angle satisfying 


sin[@(w)] — qyx(w) 


cos[A(w)] — cyx(w)’ [10.4.39] 


The function R(w) is sometimes described as the gain while @(w) is called the 
phase? 


'The coherence is sometimes alternatively defined as the square root of this magnitude. The sample 
coherence based on the unsmoothed periodogram is identically equal to 1. 


*See, for example, Fuller (1976, p. 156). 
‘The gain is sometimes alternatively defined as R(w)/s.x(w). 


10.4. The Spectrum for Vector Processes 275 


The Population Spectrum for Vector MA and AR Processes 


Let y, be a vector MA(™) process with absolutely summable moving average 
coefficients: 


y= nt U(Le,, 
where 


2 fort = 
E(e,£,) = {§ 


otherwise. 


Substituting [10.3.4] into [10.4.3] reveals that the population spectrum for y, can 
be calculated as 


sy(w) = (2m)- '[W(e-*)]O[W(e*)]’. [10.4.40] 


For example, the population spectrum for a stationary VAR( p) as written in [10.1.4] 
iS 


Sy(w) = (27)~"{I, — Bye - a Sg = Dre Pe A 1g (10.4.41] 
“K {L, = Pie = Pier ed Pp, er}-!, 

Estimating the Population Spectrum 

If an observed time series y,, y2, . . . , yr can be reasonably described by a 


pth-order vector autoregression, one good approach to estimating the population 
spectrum is to estimate the parameters of the vector autoregression [10.1.4] by 
OLS and then substitute these parameter estimates into equation [10.4.41]. 

Alternatively, the sample cross periodogram from x to y at frequency w, = 
2nj/T can be calculated from [10.4.32] to [10.4.34], where &,, §, 4,, and d, are as 
defined in [10.4.20] through [10.4.24]. One would want to smooth these to obtain 
a more useful estimate of the population cross spectrum. For example, one rea- 
sonable estimate of the population cospectrum between X and Y at frequency w, 
would be 


h h oS 
é yx(@;) = Ds (petal C yx(j +m) , 


m=~h 


where ¢,.(w;,,,) denotes the estimate in [10.4.33] evaluated at frequency 
® j+ = 2m(j + m)/T and his a bandwidth parameter reflecting how many different 
frequencies are to be used in estimating the cospectrum at frequency w). 

Another approach is to express the smoothing in terms of weighting coeffi- 
cients Kf to be applied to I, when the population autocovariances in expression 
[10.4.3] are replaced by sample autocovariances. Such an estimate would take the 
form 


Sy(w) = (27)~" {fs + > «i[P,e- + tie')} 


276 Chapter 10 | Covariance-Stationary Vector Processes 


where 


tr, = T' dy t y)(y,-« = y)' 
y=T"! >> y 


For example, the modified Bartlett estimate of the multivariate spectrum is 


8(w) = (2m)! {f, +> . 


7 4 (Pe - + ren}. [10.4.42] 


Filters 


Let x, be an r-dimensional covariance-stationary process with absolutely 
summable autocovariances and with (r x r) population spectrum denoted sx(w). 
Let {H,}7._. be an absolutely summable sequence of (m x r) matrices, and let 
y, denote the n-dimensional vector process given by 


y, = H(L)x, = oe Hak 4: 
It follows from [10.3.9] that the population spectrum of y (denoted sy(w)) is related 


to that of x according to 


Sy(w) = [H(e isx(o) [He]. [10.4.43] 


(axn) xr) (rxry (rx 
As a special case of this result, let X, be a univariate stationary stochastic 
process with continuous spectrum s,(w), and let u, be a second univariate stationary 


stochastic process with continuous spectrum s,,(w), where X, and uw, are uncorre- 
lated for all ¢ and 7. Thus, the population spectrum of the vector x, = (X,, u,)' is 


given by 
_ | Sxx(@) 0 
i | 0 oe j 


Define a new series Y, according to 


D AgX,-4 + Uy = A(L)X, + uy, [10.4.44] 
k=-x 


where {h,}%._. is absolutely summable. Note that the vector y, = (X,, Y,)’ is 
obtained from the original vector x, by the filter 


y = H(L)x,, 


1 0 
aa is as i] 


10.4. The Spectrum for Vector Processes 277 


where 


It follows from [10.4.43] that the spectrum of y is given by 


Oe ae eg 
Sy(w) = h(e-*) 1 t) Syu(w) $0 1 [10.4.45] 


. Sxx(w) Sxx(w)h(e) 
h(em)sxx(a) h(e~™)sxx(w)h(e) + suu(w) 


where 


e-*) = y hye ~i@k, [10.4.46] 


km =x 


The lower left element of the matrix in [10.4.45] indicates that when Y, and X, are 
related according to [10.4.44], the cross spectrum from X to Y can be calculated 
by multiplying [10.4.46] by the spectrum of X. 

We can also imagine going through these steps in reverse order. Specifically, 
suppose we are given an observed vector y, = (X,, Y,)' with absolutely summable 
autocovariance matrices and with population spectrum given by 


Sy(w) = bak Ol [10.4.47] 


Syx(w) Syy(w) 


Then the linear projection of Y,on{X,_,}%. —» exists and is of the form of [10.4.44], 
where u, would now be regarded as the population residual associated with the 
linear projection. The sequence of linear projection coefficients {h,}j. _. can be 
summarized in terms of the function of w given in [10.4.46]. Comparing the lower 
left elements of [10.4.47] and [10.4.45], this function must satisfy 


h(e-™)sxx(w) = Syx(w). 


In other words, the function h(e~) can be calculated from 


h(e-') = ao [10.4.48] 


assuming that syy(w) is not zero. When syx(w) = 0, we set h(e~'”) = 0. This 
magnitude, the ratio of the cross spectrum from X to Y to the spectrum of X, is 
known as the transfer function from X to Y. 

The principles underlying [10.4.4] can further be used to uncover individual 
transfer function coefficients: 


h, = (27)7! c h(e-)el* dw. 


In other words, given an observed vector (X,, Y,)’ with absolutely summable au- 
tocovariance matrices and thus with continuous population spectrum of the form 
of [10.4.47], the coefficient on X,_, in the population linear projection of Y, on 
{X,_, i.» can be calculated from 


= (2m)7! [ * Syl) siok dey, [10.4.49] 


7 Sxx(w) 


278 Chapter 10 | Covariance-Stationary Vector Processes 


10.5. The Sample Mean of a Vector Process 


Variance of the Sample Mean 


Suppose we have a sample of size T, {y,, yo, . .- . Y7}, drawn from an r- 
dimensional covariance-stationary process with 

E(y) = 2 [10.5.1] 

El(y, — w)(y.-; — B)'] = 9. [10.5.2] 


Consider the properties of the sample mean, 
i. 
Yr = (1/T) 2 Y,- [10.5.3] 


As in the discussion in Section 7.2 of the sample mean of a scalar process, it is 
clear that E(y7) = p and 


El(yr — B)(¥r — #)’] 
= (VT*)E{(y, — p(y, — By’ + (y2 — BY + + (yr - BT 
+ (yo — Bly, — Bw) + (yn — BY + + (yr - B)'] 
+ (ys — w(y: — B) + (yo - Bt + yr - BH) 
+++ + (yp — w)(yi — w)’ + (y2 — Bw) +--+ + (yr - BT 


= (U/T*){(Ty + Te, + 0+ + Pia] [10.5.4] 
+ (PF, + M+ Pi, +--+ 4+ Try 
+ (7, +7, + 0% +P, +--+ + Pores) 
teeet (pe t+ Pp + Ter +++ + Pol 


= (I/T?){TTy + (T — IT, + (T- 29, +--+ + Try 
+ (T- VP_, + (T — 207, +0 + PF iez-a}. 


Thus, 
T-El(yr — #)(¥7 — #)'] 
= Ty + ((T- ITIP, + [(T ~ 2VTI + [10.5.5] 
+ (VT) 7_, + ((7 -— ITT _, + (7 - 2y7]P_, 
tee + (UTP rey: 
As in the univariate case, the weights on I, for |k| small go to unity as T— ©, 


and higher autocovariances go to zero for a covariance-stationary process. Hence, 
we have the following generalization of Proposition 7.5. 


Proposition 10.5: Let y, be a covariance-stationary process with moments given by 


[10.5.1] and [10.5.2] and with absolutely summable autocovariances. Then the sam- 
ple mean [10.5.3] satisfies 


(a) yr> By 
(6) Kim {T-E[(yr — »)(¥r — B)'T} = Pe 


10.5, The Sample Mean of a Vector Process 279 


The proof of Proposition 10.5 is virtually identical to that of Proposition 7.5. 
Consider the following (” x m) matrix: 
* T-I 
DY, ~ T E[(¥r — w)(¥r - B)'] = Pa se > (lv/T)r,, [10.5.6] 


vay —(T-Q 


where the equality follows from [10.5.5]. Let y{” denote the row i, column j element 


of f,.. The row i, column j element of the matrix in [10.5.6] can then be written 


T-1 
Dy + d(T )yy”. 
(vl=T7 v=—-(T-1) 
Absolutely summability of {f,}7_._. implies that for any e > O there exists a q 
such that . 
py ly$?| < €/2. 
\vl>q 
Thus, 
T-1 q 
D+ Dd (lvl/T)yy?} <2 + D (lvl y¥"- 
he va -(T-1) ve—q 


This sum can be made less than e by choosing T sufficiently large. This establishes 
claim (b) of Proposition 10.5. From this result, E(y;7 — wu)? = 0 for each i, 
implying that Vir be 


Consistent Estimation of T Times the Variance 
of the Sample Mean 


Hypothesis tests about the sample mean require an estimate of the matrix in 
result (b) of Proposition 10.5. Let S represent this matrix: 


S = lim T-E[(¥r — w)(¥r — #)']- [10.5.7] 


If the data were generated by a vector MA(q) process, then result (b) would 
imply 


S=- D>) 7;. [10.5.8] 
v=-q 
A natural estimate then is 
an q A A 
S=<f+>@0+97), [10.5.9] 
vel 
where 
A Ls — _— 
f, = (1/7) x - y)(y,-, — yy’. [10.5.10] 
tv 


As long as y, is ergodic for second moments, [10.5.9] gives a consistent es- 
timate of [10.5.8]. Indeed, Hansen (1982) and White (1984, Chapter 6) noted that 
[10.5.9] gives a consistent estimate of the asymptotic variance of the sample mean 
for a broad class of processes exhibiting time-dependent heteroskedasticity and 
autocorrelation. To see why, note that for a process satisfying E(y,) = p with 


280 Chapter 10 | Covariance-Stationary Vector Processes 


time-varying second moments, the variance of the sample mean is given by 
El(¥r — w)¥r — )'] 
iT T “ 
= ne, 2 (y, ~ »| [wn > (ys — »| [10.5.11] 


T) > 2 Elty, - w)ys — #)'). 


Suppose, first, that E[(y, — p)(y, — #)’] = 0 for |t — s| > q, as was the case for 
the vector MA(q) process, though we generalize from the MA(q) process to allow 
El(y, — »)(ys — #)‘] to be a function of ¢ for |¢ — s] = qg. Then [10.5.11] implies 


T-El(¥r — w)(¥r — B)') 
= (1/T) 2 El(y, — wy, — #)'] 


+ (1/T) > {Ely, — p)y,-. — B)'] + El(y,-1 - wy, — #)'T} 
+ (UT) 2 {El(y, — w)(¥,-2 — B)'] + El(y.-2 — wy, — BT ++: 


E[ty, — -g 7 BY) + El(y-¢ 7 
+ (ur) > | [(y, -— w)(y.-q — #)') + ElQ-¢ — HY — BT. es) 
The estimate [10.5.9] replaces 
; 


(WT) > Elty, - wy» — 8) [10.5.13] 


rvt 


+ 


in [10.5.12] with 


(1/T) > (y= Fev ~ Ynys [10.5.14] 
met 
and thus [10.5.9] provides a consistent estimate of the limit of [10.5.12] whenever 
[10.5.14] converges in probability to [10.5.13]. Hence, the estimator proposed in 
[10.5.9] can give a consistent estimate of T times the variance of the sample mean 
in the presence of both heteroskedasticity and autocorrelation up through order q. 
More generally, even if E[(y, — »)(y, — m)'] is nonzero for all t and s, as 
long as this matrix goes to zero sufficiently quickly as |t ~ s] > ~, then there is 
still a sense in which §, in [10.5.9] can provide a consistent estimate of S. Specif- 
ically, if, as the sample size T grows, a larger number of sample autocovariances 
q is used to form the estimate, then $, 4s (see White, 1984, p. 155). 


The Newey-West Estimator 


Although [10.5.9] gives a consistent estimate of S, it has the drawback that 
[10.5.9] need not be positive semidefinite in small samples. If § is not positive 
semidefinite, then some linear combination of the elements of y is asserted to have 
a negative variance, a considerable handicap in forming a hypothesis test! 

Newey and West (1987) suggested the alternative estimate 


§=f+> E - at (fT, + 4), [10.5.15] 


10.5. The Sample Mean o a Vector Process 281 


where I’, is given by [10.5.10]. For example, for g = 2, 
S=f + i@,+ fp) +id,4+ fy. 
Newey and West showed that S is positive semidefinite by construction and has 


the same consistency properties that were noted for S, namely, that if g and T 
both go to infinity with g/T'/* — 0, then S,4 S. 


Application: Autocorrelation- and Heteroskedasticity-Consistent 
Standard Errors for Linear Regressions 


As an application of using the Newey-West weighting, consider the linear 
regression model 


y= xP + 4, 


for x, a (k x 1) vector of explanatory variables. Recall from equation [8.2.6] that 
the deviation of the OLS estimate b; from the true value f satisfies 


VT(b; — B) = lw > sxi| lawn) > xu] [10.5.16] 


In calculating the asymptotic distribution of the OLS estimate b;, we usually assume 
that the first term in [10.5.16] converges in probability to Q-'; 


T -1 
Jan 2, wai | 4-4 [10.5.17] 


The second term in [10.5.16] can be viewed as \/T times the sample mean of the 
(kK x 1) vector x,u,: 


[avty 3 xu] = WHGIT) Ey. 
= VT-Y;, 


where y, = x,u,. Provided that E(u,|x,) = 0, the vector y, has mean zero. We can 


allow for conditional heteroskedasticity, autocorrelation, and time variation in the 
second moments of y,, as long as 


[10.5.18] 


S = lim T- E(y7y7) 


Tox 
exists. Under general conditions,‘ it then turns out that 
Juv) S su = VT-¥r> NOS). 
Substituting this and [10.5.17] into [10.5.16], 
VT(b; — B) > N(0, Q-'SQ-'). [10.5.19] 


In light of the foregoing discussion, we might hope to estimate S by 
q 
8S; = Vy7 + p> E = 


“See, for example, White (1984, p. 119). 


ql (fr + Pip). [10.5.20] 


282 Chapter 10 | Covariance-Stationary Vector Processes 


Here, 


T 
ie = (1/T) > (x,4,,7it,_,.7X;-\), 


tay 
ui, is the OLS residual for date ¢ in a sample of size T (a, = y, — xib;), and q 
is a lag length beyond which we are willing to assume that the correlation between 
x,u, and x,_,.u,_, is essentially zero. Clearly, Q is consistently estimated by 
Q, = (I/T)Z7,x,x;. Substituting Q, and §, into [10.5.19], the suggestion is to 
treat the OLS estimate b; as if 
by a N(R, (V;/ T)) 
where 
Vr= 07'S, 7! 
T 


T -1 
- [am Z xxi] am [3 ain, 
t=] t=! 
q v T — 
+ > 1 a q + 1 BY (x,4,0,-)X7—y 4 X, — yy, X;) 


tevsrl 
T -! 

x {ary 3 xxi] 
t= 


that is, the variance of b7 is approximated by 
(V,/T) 


j T T -1t 
+ E _ : oy (x,4,4,—)X;-» + st, -x) || S vai 
a t t- 


[10.5,21] 


where d, is the OLS sample residual. The square root of the row i, column i element 
of ¥,/T is known as a heteroskedasticity- and autocorrelation-consistent standard 
error for the ith element of the estimated OLS coefficient vector. The hope is that 
standard errors based on [10.5.21] will be robust to a variety of forms of heter- 
oskedasticity and autocorrelation of the residuals u, of the regression. 


Spectral-Based Estimators 


A number of alternative estimates of S in [10.5.7] have been suggested in the 
literature. Notice that as in the univariate case discussed in Section 7.2, if y, is 
covariance-stationary, then S has the interpretation as the autocovariance-gener- 
ating function Gy(z) = Zf_._.1,z" evaluated at z = 1, or, equivalently, as 27 
times the population spectrum at frequency zero: 

S= > Pf, = 2ms,(0). 
Indeed, the Newey-West estimator [10.5.15] is numerically identical to 27 times 
the Bartlett estimate of the multivariate spectrum described in [10.4.42] evaluated 
at frequency w = 0. Gallant (1987, p. 533) proposed a similar estimator based on 
a Parzen kernel, 


$= t+ 3 Kwa + pit. +t, 


v=l 


10.5. The Sample Mean of a Vector Pr cess 83 


where 
1 ~ 622+ 6z3 for0sz<} 
k(z) = 42(1 - zpP fort =z=1 
0 otherwise. 


For example, for gq = 2, we have 
S=f +50, +) + ach + B5). 
Andrews (1991) examined a number of alternative estimators and found the 
best results for a quadratic spectral kernel: 
7 3 sin(67z/5) 
a) <= | 6mrz/5 


- cos 6m) : 


In contrast to the Newey- West and Gallant estimators, Andrews’s suggestion makes 
use of all T - 1 estimated autocovariance estimators: 


_ T 2 T-1 v if fe 
Soa y E + 2 - ; “et, +f) ]. [10.5.22] 


Even though [10.5.22] makes use of all computed autocovariances, there is still a 
bandwidth parameter q to be chosen for constructing the kernel. For example, for 
q = 2, , 


t+ 5 kwayt, +9 = ty + 08st, + tp 
+ 0.50(f, + £3) + 0.14%; +f) +---. 


Andrews recommended multiplying the estimate by 7/(T — k), where y, = x,G, 
for a, the sample OLS residual from a regression with k explanatory variables. 
Andrews (1991) and Newey and West (1992) also offered some guidance for choos- 
ing an optimal value of the lag truncation or bandwidth parameter gq for each of 
the estimators of S that have been discussed here. 

The estimators that have been described will work best when y, has a finite 
moving average representation. Andrews and Monahan (1992) suggested an al- 
ternative approach to estimating S that also takes advantage of any autoregressive 
structure to the errors. Let y, be a zero-mean vector, and let S be the asymptotic 
variance of the sample mean of y. For example, if we want to calculate hetero- 
skedasticity- and autocorrelation-consistent standard errors for OLS estimation, y, 
would correspond to x,d@, where x, is the vector of explanatory variables for the 
regression and d, is the OLS residual. The first step in estimating S is to fit a low- 
order VAR for y,, 


y, = Dy,-, + Pry,-2 + -°° + Dy, + v,, [10.5.23] 


where v, is presumed to have some residual autocorrelation not entirely captured 
by the VAR. Note that since y, has mean zero, no constant term is included in 
[10.5.23]. The ith row represented in [10.5.23] can be estimated by an OLS regres- 
sion of the ith element of y, on p lags of all the elements of y, though if any 
eigenvalue of |I,4? — @,a?-! — ,,?-? — - - - — | = 0 is too close to the unit 
circle (say, greater than 0.97 in modulus), Andrews and Monahan (1992, p. 957) 
recommended altering the OLS estimates so as to reduce the largest eigenvalue. 

The second step in the Andrews and Monahan procedure is to calculate an 
estimate S* using one of the methods described previously based on the fitted 


284 Chapter 10 | Covariance-Stationary Vector Processes 


residuals %, from [10.5.23]. For example, 


A kl - - 
$s=f3+ > E aes |e: + f*), [10.5.24] 


where 


and where q is a parameter representing the maximal order of autocorrelation 
assumed for v,. The matrix S$? will be recognized as an estimate of 27-s,(0), where 
Sy(w) is the spectral density of v: 


Sy(w) = Qn)! 3D {B(uvi_dle 


Notice that the original series y, can be obtained from v, by applying the following 
filter: 


y = (I, — ®L - ®L? — +++ — © L]"'y, 
Thus, from [10.4.43], the spectral density of y is related to the spectral density of 
v according to 
sy(w) = {[L, — Bem — @ye-% — --- — @,e-P}}~'sy (w) 

x {{L, — Be — Be%™ — +--+ — P,ere]}-!, 
Hence, an estimate of 27 times the spectral density of y at frequency zero is given 
by 

§, = {, - &, - @,- +++ - ,]}-'87 

x {{(L, = 6, a 6, ro ee 6,)'}-', 
where § * is calculated from [10.5.24]. The matrix $; in [10.5.25] is the Andrews- 
Monahan (1992) estimate of S, where 

S = lim T:E(y;7y7). 


To 


(10.5.25] 


APPENDIX 10.A. Proofs of Chapter 10 Propositions 


= Proof of Proposition 10.1. The eigenvalues of F are the values of A for which the following 
determinant is zero: 


(®, _ Ar,,) ®, ®, poets ®, -1 ®,, 
. -M, 0 - 0 0 
0 I,  —Al, 0 0 [10.A.1] 
0 0 0 -. LT, =at 


Multiply each of the final block of m columns by (1/A) and add to the previous block. Multiply 
each of the n columns of this resulting next-to-final block by (1/A) and add the result to the 
third-to-last block of columns. Proceeding in this manner reveals [10.A.1] to be the same 
as 


X, X, 
; 10.A.2 
fe a 10a 


where X, denotes the following (n x 7) matrix: 
X, = (@, — AL,) + (Bz/A) + (@3/A?) + + @,/A-') 


Appendix 10.A. Proofs of Chapter 10 Propositions 285 


and X, is a related [n x n(p — 1)] matrix. Let S denote the following (np X np) matrix: 


—_ 0 Lap-» 
s=[P Mo], 


and note that its inverse is given by 


0 I 
S-l= ae 
lene ‘| 


us may be verified by direct multiplication. Premultiplying a matrix by S and postmultiplying 
by S~' will not change the determinant. Thus, [10.A.2] is equal to 


0 Tag] Xr. o I, 
I, 0 0 AL yt) Lap- 0 


Applying the formula for calculating a determinant [A.4.5] recursively, [10.4.3] is equal to 
(-Ayr= IX,] = (- Ar OD, = AL, + (Ba/A) + (3/2) +o + (@/e- 9] 
= (-1|L,a7 — Bar-! ~ Bree? -~ -- - @), 


Setting this to zero produces equation [10.1.13]. = 


—ALyy-1) 0 
x ue [10.A.3] 


4 


@ Proof of Proposition 10.2. It is helpful to define z,(i, /) to be the component of y,, that 
reflects the cumulative effects of the /th element of €: 


% 


2(i, 1) = pire, + We —1 + WPey2 bo = > WH Erg — os [10.4.4] 
vel 
where wf? dendtes the row i, column / element of the matrix W,. The actual value of the 
ith variable y,, is the sum of the contributions of each of the / = 1,2... . , 1 components 
of €: 
Yn = ay + D 2,4, 0). [10.4.5] 


f= 


The results of Proposition 10.2 are all established by first demonstrating absolute summability 
of the moments of z,(i, /) and then observing that the moments of y, are obtained from 
finite sums of these expressions based on z,(i, /). 


Proof of (a). Consider the random variable z,(i, /)+z,_,(j, m), where i, [, j, and m represent 


arbitrary indices between 1 and n and where s is the order of the autocovariance of y that 
is being calculated. Note from [10.A.4] that 


Et|> Wear] X y pe Egidasay 
{[3+ er [So ani 1} [10.4.6] 
D >, Ww Ele base—vt: 


rate 


tt 


E{z,(i, 1)+2,-,(), m)} 


The expectation operator can be moved inside the summation here because 
SS west = & S wer-wet = {3 we} x {3 wet} <= 
rel val rad r=0 rel re0 


Now, the product of e’s in the final term in [10.A.6] can have nonzero expectation only if 
the e’s have the same date, that is, if r = s + v. Thus, although [10.A.6] involves a sum 
over an infinite number of values of r, only the value at r = s + v contributes to this sum: 


Elz (G1) 2,20). mp = DWE Wyeth EXE a ebmg end = DWE Winns [10.4.7] 
yal = 
where a, represents the covariance between «, and ¢,,, and is given by the row /, column 


m element of 9. 
The row i, column j element of I’, gives the value of 


Ye = E(ya — wa Yua~s ~— My) 


286 Chapter 10 | Covariance-Stationary Vector Processes 


Using [10.A.5] and [10.A.7], this can be expressed as 


E(u ~ MY Yises — we) = E {{3 z(i, off 3, sh m} 
= y > E{z,(i, 1)+2,..C/, a} 
ee [10.4.8] 


a = 


t 
BS > > We ie inn 


f=tDm=p er 


x 


i 
Dy We Wan: 


f v= fo} m=t 


HT 


But St, 24., Bie Wir Fin is the row i, column / element of W, +8, Thus, [10.A.8] 
aes that the row i, column / element of I, is given by the row i, column / slement of 
Try W,.,@W/, as asserted in part (a). 


Proof of (b). Define 4,(-) to be the moment in [10.A.7]: 
A, fim) = E{z(i, D-2,-. 4, m} = oy Le ons 
and notice that the sequence {f,(-)};.1, is absolutely summable: 
ry Wasi, fs m)l S > Dy wi pel * Len 


= |o,,| > [War > wir” 
S lon o lyin | > yi | 


<4, 


[10.A.9] 


Furthermore, the row i, column j element of I’, was seen in [10.A.8] to be given by 


w= > DAG jm). 


=lm=l 


Hence, 


Sis T SD mG Alm => SD lai Alm). [10.A.10] 


falm=}s 


From [10.4.9], there exists an M < = such that 
> Mei, i, fm) <M 
s=0 

for any value of i, j, /, or m. Hence, [10.A.10] implies 


Dwi<d > Mane <a, 


fal met 


confirming that the row i, column / element of {I,}%, is absolutely summable, as claimed 
by part (b). 


Appendix 10.A. Proofs of Chapter 10 Propositions 287 


Proof of (c). Essentially the identical algebra as in the proof of Proposition 7.10 establishes 
that 


E| z, (ih) ‘ 2 (dat) ZCba 4): 2, (ba, 1,)| 


-E iP Wenaen {3, ve Un I 


{3 Wane F003 — <a} {3 We Tyla | [10.A.11] 


ey > La Ak HR 2d 


FH v0 eye ey =O 


iA 


x E| En .n-y, Ena oly n— Elen val 
<M, 
Now, 


© Ena Inndindienl = & . + > 2,(i, ) . |, + Dy 2,(é2, | 
| yw, + > 2, 1 | yw, + Py EXE) 
E {lal + 3 leis tol}-flaul + 3 baal} 
{lal + 3, leatis tl {ln + 3 2G tal} 


But this is a finite sum involving terms of the form of [10.A.11]—which were seen to be 
finite—along with terms involving first through third moments of z, which must also be 
finite. 


Proof of (d). Notice that 
2,(i, !): z, Ah, m) = y y we ety - remy ow 


reQvsd 


The same argument ee to [7.2.14] can be used to establish that 


(/T) > 2(i, Pz, i,m) Efz,(i, Dez, .G. mh}. [10.4.12} 


\A 


To see-that [10.A.12] implies ergodicity for the second moments of y, notice from 
[10.A.5] that 


T T a it 
(UT) & Youre = (UT) & [. + D ali ||. + Dd tis | 


2 


= yay + ow D fans D 2a. »| + H, > [an 3 2 2 o| 


ms 


+ s > [wn > Z(i, Dz,_ Cf m| 


f=tom-t 


So ty + a Ele som) + oy > Elz. 


+ > bs E[z,(@é, )z,-.(/, m)] 


t=tm=1 


= {| Ba y 2, o| E aa y Z,-s(hs |} 
f=1 wel 
\ 


= ELyi¥yr-sl- 
as claimed, @ 


288 Chapter 10 | Covariance-Stationary Vector Processes 


8 Proof of Proposition 10.3. Writing out [10.2.1 1] explicitly, 
H(L)W(L) = (> + HL + HL’ + HL! +++) 
X (HW L9+ WL) + WL + ++ -), 
from which the coefficient on L* is 
B, = H.W, + H, ,»W, + H, »W.t--°. [10.A.13] 


Let b1*) denote the row i, column j element of B,, and let 4{*) and wit) denote the row i, 
column j elements of H, and W,, respectively, Then the row i, column / element of the 
matrix equation [10.A.13] states that 


ui nt 

ik) (ay pay > (KM) 242) er ay: (Kk vlyptey 

be = DO +S Ae My + > att wa + ae Mie 
mot wil wy 1 Oe 


Thus, 


Ed 


[py = yd a wey 


kee * veil 3 


e > Dy Ass, yh [10.A.14] 


~er-im 


Dp) weal lass le 


mp Lett 


eel 


aa 


\ 


But since {H,}; . and{W,}; ., are absolutely summable, 


Dd [ai <M, < 
k ’ 


D Wil < My < » 
Thus, [10.A.14] becomes 
Da |p| < > M,M,<~, 


@ Proof of Proposition 10.4. Let A be (#1 X ), Bbe (1 x r), and C be (r x q). Let the 
(n x 1) vector b, denote the ‘th column of B, and let c,, denote the row /, column j element 
of C. Then 

Cy Cry Cy 


ABC Cx Cap "7" Cry 


Alb, b,-"- b,] 
Gy Ga tte Cry 

[{Ab,c,, + Ab,c., + +++ + Ab,c,,} 
{Ab,c,. + Ab.c.. +--+: + Ab,c,.}+-- 
{Ab,c,, + Abjc,, + °° + Ab,c,.,} 

= [{c,,Ab, + c,,Ab, + +++ + c,,Ab,} 
{c,,Ab, + c.,Ab, + «++ + ¢,.Ab,}°-- 
{c,,,Ab, + c,,Ab, + ++ * + c,,Ab, }). 


Applying the vec operator gives 


c,,Ab, + c,,Ab, +--+ ~° + ¢,,Ab, 
c,,Ab, + c.,Ab, + -- > + ¢,.Ab 


tt 


vec(ABC) : 
¢,,Ab, + c2,Ab, pret be Ab, 


CyA cyA +? c¢,,Alfb, 
CpA CoA +°+ c,All bs 


cA Cy A ne Cry A b, 
(C’@ A)-vec(B). & 


Appendix 10.A. Proofs of Chapter 10 Propositions 289 


Chapter 10 Exercises 


10.1. Consider a scalar AR(p) process (n = 1). Deduce from equation [10,2.19] that the 
(p X 1) vector consisting of the variance and first (p — 1) autocovariances, 


Yp-i 


can be calculated from the first p elements in the first column of the (p? X p?) matrix 
o'{1,: — (F @ F)]"' for F the (p x p) matrix defined in equation [1.2.3] in Chapter 1. 


1.2, Let y, = (X,, Y,)’ be given by 
X, = &, + Ge, , 
Y, = A,X, , + u, 


where (¢,, u,)’ is vector white noise with contemporaneous variance-covariance matrix given 


by 
E(€7) pe = K 0 
E(ue) E(u) 0 ai} 
(a) Calculate the autocovariance matrices {I',}7. _.. for this process. 
(b) Use equation [10.4.3] to calculate the population spectrum. Find the cospectrum 
between X and Y and the quadrature spectrum from X to Y. 
(c) Verify that your answer to part (b) could equivalently be calculated from expres- 
sion [10.4.45]. - ‘ 
(d) Verify by integrating your answer to part (b) that [10.4.49] holds; that is, show 


that 
“1” Sex(@) ier e fork = 1 
(27) |  Sxy:(w) e - 0 for other integer k. 
Chapter 10 References 


Andrews, Donald W. K. 1991. “Heteroskedasticity and Autocorrelation Consistent Co- 
variance Matrix Estimation.” Econometrica 59:817-58. 

and J, Christopher Monahan. 1992. ‘An Improved Heteroskedasticity and Auto- 
correlation Consistent Covariance Matrix Estimator.” Econometrica 60:953-66. 

Fuller, Wayne A. 1976. Introduction to Statistical Time Series. New York: Wiley. 

Gallant, A. Ronald. 1987. Nonlinear Statistical Models. New York: Wiley. 

Hansen, Lars P. 1982. “Large Sample Properties of Generalized Method of Moments Es- 
timators."” Econometrica 50:1029-54. 

Newey, Whitney K., and Kenneth D. West. 1987. “A Simple Positive Semi-Definite, Het- 
cee tures and Autocorrelation Consistent Covariance Matrix.” Econometrica 55: 
703-8. 


and . 1992. “Automatic Lag Selection in Covariance Matrix Estimation.” 
University of Wisconsin, Madison, Mimeo. 
Sims, Christopher A. 1980. “Macroeconomics and Reality.” Econometrica 48:1-48. 


White, Halbert. 1984. Asymptotic Theory for Econometricians. Orlando, Fla.: Academic 
Press. 


290 Chapter 10 | Covariance-Stationary Vector Processes 


11 


Vector Autoregressions 


The previous chapter introduced some basic tools for describing vector time series 
processes. This chapter looks in greater depth at vector autoregressions, which are 
particularly convenient for estimation and forecasting. Their popularity for ana- 
lyzing the dynamics of economic systems is due to Sims’s (1980) influential work. 
The chapter begins with a discussion of maximum likelihood estimation and hy- 
pothesis testing. Section 11.2 examines a concept of causation in bivariate systems 
proposed by Granger (1969). Section 11.3 generalizes the discussion of Granger 
causality to multivariate systems and examines estimation of restricted vector au- 
toregressions. Sections 11.4 and 11.5 introduce impulse-response functions and 
variance decompositions, which are used to summarize the dynamic relations be- 
tween the variables in a vector autoregression. Section 11.6 reviews how such 
summaries can be used to evaluate structural hypotheses. Section 11.7 develops 
formulas needed to calculate standard errors for impulse-response functions. 


11.1. Maximum Likelihood Estimation and Hypothesis | 
Testing for an Unrestricted Vector Autoregression 


The Conditional Likelihood Function 
for a Vector Autoregression 


Let y, denote an (m x 1) vector containing the values that ” variables assume 
at date t. The dynamics of y, are presumed to be governed by a pth-order Gaussian 
vector autoregression, 


y= e+ Dy,_, + By. +--- + ®,y,_, + &, [it.4.1] 


with €, ~ iid. N(0, Q). 

Suppose we have observed each of these n variables for (T + p) time periods. 
As in the scalar autoregression, the simplest approach is to condition on the first 
p observations (denoted y_,41,¥_— »+2.- ++ » Yo) and to base estimation on the last 
T observations (denoted y,, y2,..., Yr). The objective then is to form the con- 
ditional likelihood 


Fervent | ¥0¥ —10..¥ -pe¥7> Yr-toe--s yilyo. Y-ts- +> Y-p+s 6) [11.1.2] 


and maximize with respect to @, where @ is a vector that contains the elements of 
c, ©, ®,,..., ®,, and QO. Vector autoregressions are invariably estimated on 
the basis of the conditional likelihood function [11.1.2] rather than the full-sample 


291 


unconditional likelihood. For brevity, we will hereafter refer to [11.1.2] simply as 
the “likelihood function” and the value of @ that maximizes [11.1.2] as the ‘“‘max- 
imum likelihood estimate.” 

The likelihood function is calculated in the same way as for a scalar auto- 
regression. Conditional on the values of y observed through date ¢ — 1, the value 
of y for date ¢ is equal to a constant, 


e+ Dy,_, + Py,. + +--+ + By,_p, [11.1.3] 
plus a N(0, ©) variable. Thus, 


YlY—-15 Yi-2,- Y-p+t 
1.1.4 
os n(e + ®iy,_, + ®,y,_> SH 88 So ®,y,-»). a). ] 


It will be convenient to use a more compact expression for the conditional 
mean [11.1.3]. Let x, denote a vector containing a constant term and p lags of each 
of the elements of y: 

1 
Yr-1 
x, = y,-2 | - [11.1.5] 


43 
Thus, x, is an [(mp + 1) x 1] vector. Let II’ denote the following [n x (np + 1)] 
matrix: 
Y=[c © ® --- ,]. [11.1.6] 


Then the conditional mean [11.1.3] is equal to II’x,. The jth row of II’ contains 
the parameters of the jth equation in the VAR. Using this notation, [11.1.4] can 
be written more compactly as 


YAY 15 Yi-20 +--+ 9 Y-pai~ N(I'x, , 9). {11.1.7] 


Thus. the conditional density of the tth observation is 


Frveixe nnd p Alyn Yi-20- + + > Y-paci 9) 
= (27)-"?|Q-'|'? exp[(-1/2)(y, - TI'x,)'Q-"(y, — TI'x,)]. 


The joint density of observations 1 through ¢ conditioned on yy, y_,,..- 
Y—p+1 satisfies 


[11.1.8] 


Pp pate YY = teee¥ pe 1 Vor Yi-as sees ysl Yo. Y-ts ee + Y-pans 6) 
= Ppp ¥n¥ = tec¥ pai¥rots sees yilYo. Yete- ees Yoana ) 
X favre rvyeaeec¥ pe (Yel Yrots Yen2s  Yepans 9). 
Applying this formula recursively, the likelihood for the full sample y;, yy_,..... 


y, conditioned on yy. y_;.- - - » ¥-p+1 is the product of the individual conditional 
densities: 
Prep vrn tec YAY = 1cc¥ per Yrs Yreieres yi | Yo. Y-treeey Y-psws 0) 


Le [11.1.9] 
= T] Fervent neQil Ye Yi-20- + + + Y-p43 
j= 


The sample log likelihood is found by substituting [11.1.8] into [11.1.9] and taking 


292 Chapter 11 | Vector Autoregressions 


logs: 


T 
£(8) 2 log fy t¥i-se¥ pe l¥el Yon Yi-2. ++ 0s Y—pats 9) 


—(Tni2) log(27) + (7/2) log|Q-'| {11.1.10] 


- 12) >, « — Wx) O-Ky, - nx). 


Maximum Likelihood Estimate of Il 


Consider first the MLE of II, which contains the constant term c and auto- 
regressive coefficients ®,. This turns out to be given by 


7 T i =I 
W = [3 sx]| 3 «| (11.1.11] 
br x (ap + 1)] tm t= 


which can be viewed as the sample analog of the population linear projection of 
y, on a constant and x, (equation [4.1.23]). The jth row of II’ is 


T T -! 
#0 = b wail] 3 x] : [11.1.12] 
[Ux (ip +1)] t=] t=1 


which is just the estimated coefficient vector from an OLS regression of y,, on x,. 
Thus, maximum likelihood estimates of the coefficients for the jth equation of a 
VAR are found by an OLS regression of y;, on a constant term and p lags of all 
of the variables in the system. 


To verify [11.1.11], write the sum appearing in the last term in [11.1.10] as 
7 
» lo — II'x,)'Q> "iy, - nx) 

T A a 

=2 [o. — Wx, + Tx, - W’x,)'Q-"y, — T’x, + Wx, - nx) | {11.1.13] 


- [te +a T1)'x,]'Q- '[é, + (ML ~ M1)’ sl]. 


where the jth element of the (n X 1) vector &, is the sample residual for observation 
t from an OLS regression of y;, on x,: 
é, =y, — I'x,. [11.1.14] 
Expression [11.1.13] can be expanded as 
T 


> [o. - II'x,)'Q-"y, - mx) 


r=t 


T T 
= ¥ e0-'s, + 2 > 0-1 - M'x, [1.1.15] 
s=1 t=t 


2 
+ >} Wl -— IQ-'Wd - M)’x,. 
t=1 
Consider the middle term in [11.1.15]. Since this is a scalar, it is unchanged 


11.1. Estimation and Hypothesis Testing for an Unrestricted VAR 293 


by applying the “trace” operator: 


4 


T 
> #0-'W1 — 1)'x, 
tat 


trace | 
’ 


- 
trace 3 o- dL - ny xe | {11.1.16] 
i! 


é/0-°01 - ny 
t 


T 
trace 0-1 -t’> xi]. 
t= 


But the sample residuals from an OLS regression are by construction orthogo- 
nal to the explanatory variables, meaning that 27_,x,é, = 0 for all j and so 
=7_,x,é) = 0. Hence, [11.1.16] is identically zero, and [11.1.15] simplifies to 


T 

py [o. — WI'x,)'Q-'(y, — mx) 

, T Te - x 

= > ¢/0-'s, + S xd — MoQ-'d1 — M)'x,. 
t=] 


t=t 


[11.1.17] 


Since © is a positive definite matrix, Q~' is as well.' Thus, defining the 
(n X 1) vector x}? as 


x* = (II - M)’x,, 


the last term in {11.1.17] takes the form 

T T 

> xi — W)Q- Ml — W)'x, = > fx] 'xs. 

rt t=1 
This is positive for any sequence {x/*}7., other than x* = 0 for all ¢. Thus, the 
smallest value that [11.1.17] can take on is achieved when x* = 0, or when II = 
II. Since [11.1.17] is minimized by setting II = II, it follows that [11.1.10] is 
maximized by setting II = II, establishing the claim that OLS regressions provide 
the maximum likelihood estimates of the coefficients of a vector autoregression. 


Some Useful Results on Matrix Derivatives 


The next task is to calculate the maximum likelihood estimate of Q. Here 
two results from matrix calculus will prove helpful. The first result concerns the 
derivative of a quadratic form in a matrix. Let a, denote the row i, column j element 
of an (nm X n) matrix A. Suppose that the matrix A is nonsymmetric and unrestricted 
(that is, the value of a, is unrelated to the value of a,, when either i # k or j # /). 
Consider a quadratic form x’Ax for x an (n X 1) vector. The quadratic form can 
be written out explicitly as 


x/Ax = >) D>) x/4;x;, [11.1.18] 
i=) j=l 
from which 
ox'Ax 
3a, = X;X). {11.1.19] 


‘This follows immediately from the fact that ~' can be written as L’L for L a nonsingular matrix 
as in [8.3.1]. 


294 Chapter 11 | Vector Autoregressions 


Collecting these n? different derivatives into an (n X n) matrix, equation [11.1.19] 
can conveniently be expressed in matrix form as 
dx'Ax 
0A 


The second result concerns the derivative of the determinant of a matrix. Let 
A be a nonsymmetric unrestricted (n xX n) matrix with positive determinant. Then 


= xx’, [11.1.20] 


dlog|A e 
2 log|A| = all, [11.1.21] 
0a, 
where a“ denotes the row j, column / element of A~!. In matrix form, 
d log|A| 
——_ = (A')'" lL, 
7 ( [11.1.22] 


To derive [11.1.22], recall the formula for the determinant of A (equation 
[A.4.10] in the Mathematical Review, Appendix A, at the end of the book): 


|Al = > (= 1)'*/a,|A;l, [1.1.23] 
Fs 


where A,, denotes the (n — 1) X (m — 1) matrix formed by deleting row ¢ and 
column j from A. The derivative of [11.1,23] with respect to a, is 
alAl 
da; 


since the parameter a, does not appear in the matrix A,,. It follows that 
a log|A| _ 
0a, 


= (-1)*A,I, [11.1.24] 


= (VAI) (-1) Ag, 

ai 
which will be recognized from equation [A.4.12] as the row j, column i element 
of A~', as claimed in equation [11.1.2]. 


The Maximum Likelihood Estimate of 


We now apply these results to find the MLE of Q. When evaluated at the 
MLE II, the log likelihood [11.1.10] is 
L(Q, TL) = —(Tn/2) log(2m) + (T/2) log\Q-'| 
T {11.1.25] 
— (1/2) 3 8/O-'8,. 
f= 


Our objective is to find a symmetric positive definite matrix © for which this is as 
large as possible. It is instructive to consider first maximizing [11.1.25] by choosing 
0 to be any unrestricted (n x m) matrix. For that purpose we can just differentiate 
{11.1.25] with respect to the elements of -' using formulas [11.1.20] and [11.1.22]: 


~t ar 
a Mt) _ (nay 21980" _ esl O"'| _ (1p) 5 SO 
os [11.1.26] 


T 
= (T/2)Q' - (1/2) > 2,8. 
=1 
The likelihood is maximized when this derivative is set to zero, or when 


T 
= (1/T) > &,é. [11.1.27] 
f= 


11.1. Estimation and Hypothesis Testing for an Unrestricted VAR 295 


The matrix © that satisfies [11.1.27] maximizes the likelihood among the class 
of all unrestricted (n X n) matrices. Note, however, that the optimal unrestricted 
value for © that is specified by [11.1.27] turns out to be symmetric and positive 
definite. The MLE, or the value of © that maximizes the likelihood among the 
class of all symmetric positive definite matrices, is thus also given by [11.1.27]: 


T 
QO = (1/T) > 8. {11.1.28] 
f=l 
The row i, column i element of Q is given by 
T 
6? = (UT) > &, [1.1.29] 
tel 
which is just the average squared residual from a regression of the ith variable in 
the VAR ona constant term and p lags of all the variables. The row i, column / 
element of 2. is 
i 
& = (WT) D evéy, {11.1.30] 
f= 


which is the average product of the OLS residual for variable i and the OLS residual 
for variable j. 


Likelihood Ratio Tests 


To perform a likelihood ratio test, we need to calculate the maximum value 
achieved for [11.1.25]. Thus, consider 


£(Q, M1) = -(Tn/2) log(2m) + (T/2) log|Q-'| 


Kg [1.1.31] 
— (12)  #O-'2, 
tel 
for 2 given by [11.1.28]. The last term in [11.1.31] is 
r a Tr a 
(1/2) 3S &/Q-'s, = (1/2) trace| 3 ede] 
f= ral 
Tr A 
= (1/2) trace| 3 a-ee)] 
y= 
= (1/2) trace[Q~'(TQ)] 
= (1/2) trace(T-1,) 
= Tn/2. 
Substituting this into [11.1.31] produces 
£(Q, TI) = —(Tn/2) log(2r) + (T/2) log|Q-"| - (Tn/2). [1.1.32] 


This makes likelihood ratio tests particularly simple to perform. Suppose we 
want to test the null hypothesis that a set of variables was generated from a Gaussian 
VAR with pp lags against the alternative specification of p, > p, lags. To estimate 
the system under the null hypothesis, we perform a set of n OLS regressions of 
each variable in the system on a constant term and on py lags of all the variables 
in the system. Let Qy = (1/T) ZZ, &,(po)[€,(po)]’ be the variance-covariance matrix 
of the residuals from these regressions. The maximum value for the log likelihood 


296 Chapter 11 | Vector Autoregressions 


under H, is then 
Ls = —(Tn/2) log(2m) + (T/2) log|Qs"| - (Tn/2). 


Similarly, the system is estimated under the alternative hypothesis by OLS regres- 
sions that include p, lags of all the variables. The maximized log likelihood under 
the alternative is 


£* = —(Tn/2) log(2m) + (T/2) log\Q;'| — (Tni2), 


where (2, is the variance-covariance matrix of the residuals from this second set 
of regressions. Twice the log likelihood ratio is then 


ALF - L$) = 2(Ti2) log\Q>"| - (7/2) loglds '} 
= T log(1/|Q,|) - T log(1|Qo)) 
= —T log|®,| + T log|Qo| 
= T{log|®,| - log|,}}. 


Under the null hypothesis, this asymptotically has a x? distribution with degrees 
of freedom equal to the number of restrictions imposed under H,. Each equation 
in the specification restricted by Hi, has (p; — py») fewer lags on each of n variables 
compared with H,; thus, Hy imposes n(p, — po) reStrictions on each equation. 
Since there are n such equations, Hy imposes n?(p, — py) restrictions. Thus, the 
magnitude calculated in [11.1.33] is asymptotically x? with n2(p, — py) degrees of 
freedom. 

For example, suppose a bivariate VAR is estimated with three and four 
lags (xn = 2, py = 3, p, = 4). Say that the original sample contains 50 obser- 
vations on each variable (denoted y_;, y_2, ..., Yas) and that observations 1 
through 46 were used to estimate both the three- and four-lag specifications so 
that T = 46. Let é,(py) be the sample residual for observation ¢ from an OLS 
regression of y, on a constant, three lags of y,,, and three lags of yy. 
Suppose that (1/T)E7_,[€(po)? = 2.0, (1/T) 27, [éx(po)? = 2.5, and 
(V/T) 2721 €.( po) éx(po) = 1.0. Then 


B 2.0 1.0 
Q) = 
10 2.5 
and log|Q| = log 4 = 1.386. Suppose that when a fourth lag is added to each 
regression, the residual covariance matrix is reduced to 


ee & 0.9] 
0.9 2.2 
for which log|Q,| = 1.147. Then 
(Lf — LF) = 46(1.386 - 1.147) = 10.99, 

The degrees of freedom for this test are 27(4 - 3) = 4. Since 10.99 > 9.49 (the 
5% critical value for a x7(4) variable), the null hypothesis is rejected. The dynamics 
are not completely captured by a three-lag VAR, and a four-lag specification seems 
preferable. 


Sims (1980, p. 17) suggested a modification to the likelihood ratio test to take 
into account small-sample bias. He recommended replacing [11.1.33] by 


(T ~ k){log|Q| - log|®,}}, (11.1.34] 


where kK = 1 + ap, is the number of parameters estimated per equation. The 


[1.1.33] 


11.1. Estimation and Hypothesis Testing for an Unrestricted VAR 297 


adjusted test has the same asymptotic distribution as [11.1.33] but is less likely to 
reject the null hypothesis in small samples. For the present example, this test 
statistic would be 


(46 - 9)(1.386 - 1.147) = 8.84, 


and the earlier conclusion would be reversed (H,, would be accepted). 


Asymptotic Distribution of Il 


The maximum likelihood estimates TH and O will give consistent estimates of 
the population parameters even if the true innovations are non-Gaussian. Standard 
errors for II can be based on the usual OLS formulas, as the following proposition 
demonstrates. ; 


Proposition 11.1: Let 


y,=ce+ @y,_, + ®y,. + °°: + Oy, + €,, 


where e, is independent and identically distributed with mean 0, variance Q, and 
E(€y€,EnEu) < © for all i, j, 1, and m and where roots of 
IL, — Bz — Bz? — +--+ — Gz| = 0 {11.1.35] 
lie outside the unit circle. Let k = np + 1, and let x; be the (1 X k) vector 
ED Yep) Mets Pi Ne pl: 
Let w, = vec(II;) denote the (nk xX 1) vector of coefficients resulting from OLS 
regressions of each of the elements of y, on x, for a sample of size T: 
ce 
ft, = R27 


«a 


TT 


T -Ipor 
tyr = [3 vai b wy 


and let denote the (nk Xx 1) vector of corresponding population coefficients. 
Finally, let 


where 


‘ 
Q, = (1/7) > &é;, 
fet 
where 
€; = {é1, éa, os En 
Ey = Ya — X/ Ty. 


Then 


: 

(a) (UT) S xxi’ %Q where Q = E(xx/); 
fmt 

(5) tt; > TT, 


298 Chapter 11 | Vector Autoregressions 


() 0,50; 
(d) VT(#, - =) > N(0, (© @ Q-')), where @ denotes the Kronecker 
product. 


A proof of this proposition is provided in Appendix 11.A to this chapter. 
If we are interested only in @,,,, the coefficients of the ith regression in the 
VAR, result (d) implies that 


V1i(i, 7 — 7) > NO, 0707), [11.1.36] 


where o? = E(e?) is the variance of the innovation of the ith equation in the VAR. 
But a? is estimated consistently by &? = (1/T)Z7_, é? 7 the average squared residual 
from OLS estimation of this equation. Similarly, Q~ ' is estimated consistently by 
[(1/T)27_,x,x/]~'. Hence, [11.1.36] invites us to treat #, approximately as 


it, = (m3? [3 xvi] ). {11.1.37] 


But this is the standard OLS formula for coefficient variances with s? = 
{IT - k)] S7., é3 in the standard formule teplaced by the maximum likelihood 
estimate 6? in [11.1.37]. Clearly, s? and &? are asymptotically equivalent, though 
following Sims’s argument in [11.1.34], the larger (and thus more conservative) 
standard errors resulting from the OLS formulas might be preferred. Hence, Prop- 
osition 11.1 establishes that the standard OLS ¢ and F statistics applied to the 
coefficients of any single equation in the VAR are asymptotically valid and can be 
evaluated in the usual way. 

A more general hypothesis of the form Ra = r involving coefficients across 
different equations of the VAR can be tested using a generalization of the Wald 
form of the OLS x? test (expression [8.2.23]). Result (d) of Proposition 11.1 
establishes that 


VI(Ri; — 1) > v0, RO ®@ o-pe’). 


In the light of results (a) and (c), the asymptotic distribution could equivalently 
be described as 


VI(Ri, — 1) > (0, R(Q, @ Qz oR’), 


where 0, = (1/T) 27, €,€; and Q; = (1/T) 27_, x,x;. Hence, the following statistic 
has an asymptotic v distribution: 


7m) = T(Réry — 1)'(RO, @ Q7')R')" (Ret, — 1) 
(Rit; ~ r)'(R[Q; @ (7Q;)~ JR‘) (Ret, — rv) [11.1.38] 


= (Ray ~ 1)’ {r| @,6 (3 xxi) Je} (Ri, — 1). 


The degrees of freedom for this statistic are given by the number of rows of R, or 
the number of restrictions tested. 

For example, suppose we wanted to test the hypothesis that the constant term 
in the first equation in the VAR (c,) is equal to the constant term in the second 
equation (c,). Then R is a (1 X nk) vector with unity in the first position, —1 in 


ql 


11.1. Estimation and Hypothesis Testing for an Unrestricted VAR 299 


the (kK + 1)th position, and zeros elsewhere: 
R=f{1 00 ::- 0 -1 00 -:-- QO]. 


To apply result [11.1.38], it is convenient to write R in Kronecker product form 
as 


R=R,@R,. [1.1.39] 


where R,, selects the equations that are involved and R, selects the coefficients. 
For this example, 


R, ={1 -1 00: O 
(LX ar) 
R, ={1 000 --- Oj. 
(xk) 


We then calculate 
n| 4 @ ( xx!) ke = (R, @ Ry| a ® (3 sx} |ew: ® Ri) 


= (R,OR') ® [R. (2 sx’) Ri] 
= (67 — 262 + 63) @ €", 


where Gj is the covariance between é,, and é, and é!' is the (1, 1) element of 
(27, x,x}/)~'. Since €'' is a scalar, the foregoing Kronecker product is a simple 
multiplication. The test statistic [11.1.38] is then 


(é — &)? 


OO) = GH 2a + SDET 


Asymptotic Distribution of ce) 


In considering the asymptotic distribution of the estimates of variances and 
covariances, notice that since © is symmetric, some of its elements are redundant. 
Recall that the ‘‘vec” operator transforms an (m X n) matrix into an (n? x 1) 
vector by stacking the columns. For example, 


Fir Fy2° F13 Fi2 
vec | 02, 922 923) = O22 |. {11.1.40] 
93, G32 G33 G32 


An analogous ‘‘vech” operator transforms an ( X n) matrix into an ({n(a + 1)/2] 
x 1) vector by vertically stacking those elements on or below the principal 


300 Chapter 11 | Vector Autoregressions 


diagonal. For example, 


2, 
Fr Fiz 3 a 
vech | oy, Oy 093) = af {11.1.41] 
22 
Ox G32 F353 
Ox2 
33 


Proposition 11.2: Let 
y= e+ Wy,_, + Byy,-. + °° + ®,y,-, + &, 
where ©, ~ i.i.d. N(0, Q) and where roots of 
II, — Bz — Bz? — +--+ — 7] = 0 


lie outside the unit circle. Let tr, 0, and Q be as defined in Proposition 11.1. 


Then 
vated) — maxon (L0]- [42° ]) 
VT[vech(Q,) — vech(Q)] 0|’ 0 Sly 
Let a;, 


; denote the row i, column j element of Q; for example, o,, is the variance of 
&,,. Then the element of X2. corresponding to the covariance between 6, and &,,, is 
given by (G4F mn + Fin Fy) for all i,j, lm = 1,2,....n,includingi = j = 1 = 
m. 


For example, for n = 2, Proposition 11.2 implies that 


Bs 2 2 

Fur Fu . 0 207, 2o11F12 2o7%2 
VT | Gp.7— O12) >N| [0], [Zone enon + 02, 20% 

Gx. — F22 0 2o% 2012022 2o%, 


[11.1.42] 


Thus, a Wald test of the null hypothesis that there is no covariance between «,, 
and ¢,, is given by 
VT6,, 


(G62. + Gi2)'? ney 


A Wald test of the null hypothesis that ¢,, and ¢, have the same variance is given 
by 

TG — Sp) 
39, — 40%, + 263, 0)» 
where 67, denotes the square of the estimated variance of the innovation for the 
first equation. 

The matrix 2%. in Proposition 11.2 can be expressed more compactly using 
the duplication matrix. Notice that since Q is symmetric, the n? elements of vec(Q) 
in [11.1.40] are simple duplications of the n(n + 1)/2 elements of vech(Q2) in 
{11.1.41]. There exists a unique [n? x n(n + 1)/2] matrix D,, that Mansons vech(Q) 
into vec(Q), that is, a unique matrix satisfying 


D, vech(Q) = vec(®). [11.1.43] 


11.1. Estimation and Hypothesis Testing for an Unrestricted VAR 301 


For example, for n = 2, equation [11.1.43] is 


100 a oi, 
010 Ny | oy, 
cone “| = eek (11.1.44] 
oo 1) 7% On 


Further, define D7 to be the following [n(n + 1)/2 x n?] matrix:? 
D; = (D,D,,)~'D,. [11,1.45] 


Notice that D}D,, = Lage+1y2- Thus, premultiplying both sides of {11.1.43] by D* 
reveals D* to be a matrix that transforms vec(Q) into vech(Q) for symmetric Q: 


vech(Q) = D* vec(Q). {11.1.46] 
For example, for n = 2, equation [11.1.46] is 


ei 1 0 Oo] | 

Or 
o2,| = 10 : 0 oy {11.1.47] 
On 000 1] | 7 

O- 


22 


We 


It turns out that the matrix £2, described in Proposition 11.2 can be written 


as* 
ZX» = 2D7(Q2 © Q)(D;)’. {11.1.48] 
For example, for n = 2, expression [11.1.48] becomes 
100 0 
2DF (XQ @ QD)’ = 2/0 4 + 0 
0001 
FiFi Ariz M21 M272 100 
x Fr1F21 FM 1F22_ 12921 M127 22 03.0 
F2F14 F21F%12_ F22F 4 F227 12 0: 0 
F21F%2, F192 922921 F229 22, 001 


2 2 
207; 2011012 2012 
= 2 
= | 2oyo,2 F022 + Oh, 2 20n1, 


2 2 
2o72 202022 20%) 


which reproduces {11.1.42]. 


11.2. Bivariate Granger Causality Tests 


One of the key questions that can be addressed with vector autoregressions is how 
useful some variables are for forecasting others. This section discusses a particular 
summary of the forecasting relation between two variables proposed by Granger 
(1969) and popularized by Sims (1972). A more general discussion of a related 
question in larger vector systems is provided in the following section. 


2It can be shown that (D/D,) is nonsingular. For more details, see Magnus and Neudecker (1988. 
pp. 48-49). 
’Magnus and Neudecker (1988, p. 318) derived this expression directly from the information matrix. 


302 Chapter 11 | Vector Autoregressions 


Definition of Bivariate Granger Causality 


The question investigated in this section is whether a scalar y can help forecast 
another scalar x. If it cannot, then we say that y does not Granger-cause x. More 
formally, y fails to Granger-cause x if for all s > 0 the mean squared error of a 
forecast of x,,, based on (x,. x,_1, . . .) is the same as the MSE of a forecast of 
x,,, that uses both (x,, x,_,,...) and (y,, y,-1, .. .). If we restrict ourselves to 
linear functions, y fails to Granger-cause x if 


MSE[E (2,464 4-1, ++ DI 


= MSE[E(x,,., |x, Xp a ea Ver Vets > )] [11.2.1] 


Equivalently, we say that x is exogenous in the time series sense with respect to y if 
[11.2.1] holds. Yet a third expression meaning the same thing is that y is not linearly 
informative about future x. 

Granger’s reason for proposing this definition was that if an event Y is the 
cause of another event X, then the event Y should precede the event X. Although 
one might agree with this position philosophically, there can be serious obstacles 
to practical implementation of this idea using aggregate time series data, as will 
be seen in the examples considered later in this section. First, however, we explore 
the mechanical implications of Granger causality for the time series representation 
of a bivariate system. 


Alternative Implications of Granger Causality 


In a bivariate VAR describing x and y, y does not Granger-cause x if the 
coefficient matrices ®, are lower triangular for all ;: 


x, cy gir 0 | in 0 ies 
= + + S 
| | | vy o Yi-1 os % Yi-2 {11 2 2] 
(Pp) she 
oii 0 X-p + Ey, 
TT ey ty) . 
Py 22 Yi-p Ea, 


From the first row of this system, the optimal one-period-ahead forecast of x 
depends only on its own lagged values and not on lagged y: 


EG il Xp tes te a Ver Vente es ) [11.2.3] 
= cy + Hx, a HN X,— aes 8 + Ott, pat 


Furthermore, the value of x,,. from [11.2.2] is given by 
2 
X42 = Cy + PX + oi, AS oe PVP X:—p42 + e1a42 


Recalling [11.2.3] and the law of iterated projections, it is clear that the date ¢ 
forecast of this magnitude on the basis of (x,, X,-1,-- - .¥.Yi-19- « -) also depends 
only on (x,, %;-15 + «+» +%;-p+1)- By induction, the same is true of an s-period-ahead 
forecast. Thus, for the bivariate VAR, y does not Granger-cause x if ®, is lower 
triangular for all j, as claimed. 

Recall from equation [10.1.19] that 


W,= OY, + O8,,+ 46,8 fors = 1,2,..., 


s—p 


with W, the identity matrix and W, = 0 for s < 0. This expression implies that if 


11.2. Bivariate Granger Causality Tests 303 


®, is lower triangular for all j, then the moving average matrices W, for the fun- 
damental representation will be lower triangular for all s. Thus, if y fails to Granger- 
cause x, then the MA(c) representation can be written 


Hi] _ | Ma #4 (L) 0 Ey, 
| 7 A . bee co Ei [11.2.4] 


#,(L) = a + gL! + YL? + wPL> He als 


with yi? = yi = 1 and yi = 0. 
Another implication of Granger causality was stressed by Sims (1972). 


where 


Proposition 11.3: Consider a linear projection of y, on past, present, and future 
x's, 
y=et 2, b,X,-; + 2 d;Xi4; + TMs [11.2.5] 
Ls L= 


where b; and d; are defined as population projection coefficients, that iy, the values 
for which 


E(n,x,) = 0 forall t and +. 
Then y fails to Granger-cause x if and only if d; = O forj = 1,2,... 


Econometric Tests for Granger Causality 


Econometric tests of whether a particular observed series y Granger-causes 
x can be based on any of the three implications [11.2.2], [11.2.4], or [11.2.5]. The 
simplest and probably best approach uses the autoregressive specification [11.2.2]. 
To implement this test, we assume a particular autoregressive lag length p and 
estimate 


XH, = Cy + ax, + OQX%_2 Ht + aX, + BiY-1 


11.2.6 
+ Boy,-2 +68 + BaY:-p + u, 

by OLS. We then conduct an F test of the null hypothesis 
Hy: By = B, = +++ = B, = 0. {11.2.7] 


Recalling Proposition 8.2, one way to implement this test is to calculate the sum 
of squared residuals from [11.2.6],* 


7 
RSS, = >, a2, 
t=l 


and compare this with the sum of squared residuals of a univariate autoregression 
for x,, 


r 
RSS = » ce 
f= 
*Note that in order for ¢ to run from 1 to T as indicated, we actually need T + p observations on 
xand y, namely, t_p41, pro. Xr ANd Y_ya ys Vepsres oe a Ire 


304 Chapter 11 | Vector Autoregressions 


where 
X, = Co + War + VeX-2 $7 + MX p + [11.2.8] 
is also estimated by OLS. If 
re (RSS, — RSS,)/p 
1 RSS\(T — 2p — 1) 


is greater than the 5% critical value for an F(p, T — 2p —~ 1) distribution, then 
we reject the null hypothesis that y does not Granger-cause x; that is, if S, is 
sufficiently large, we conclude that y does Granger-cause x. 

The test statistic [11.2.9] would have an exact F distribution for a regression 
with fixed regressors and Gaussian disturbances. With lagged dependent variables 
as in the Granger-causality regressions, however, the test is valid only asymptot- 
ically. An asymptotically equivalent test is given by 


_ T(RSSy — RSS;) 
i RSS, . 


We would reject the null hypothesis that y does not Granger-cause x if 5S, is greater 
than the 5% critical values for a y?(p) variable. 

An alternative approach is to base the test on the Sims form [11.2.5] instead 
of the Granger form [11.2.2]. A problem with the Sims form is that the error term 
7, is in general autocorrelated. Thus, a standard F test of the hypothesis that d; = 0 
for all j in [11.2.5] will not give the correct answer. One option is to use autocor- 
relation-consistent standard errors for the OLS estimates as described in Section 
10.5. A second option is to use a generalized least squares transformation. A third 
option, suggested by Geweke, Meese, and Dent (1983), is as follows. Suppose the 
error term 7, in [11.2.5] has Wold representation 7, = 22(L)v2,. Multiplying both 
sides of [11.2.5] by A(L) = [22(L)]~' produces 


y, = Cp — > hyy,-; + > b} x); + py d}' X14; + V9. [11.2.11] 
is im i= 


[11.2.9] 


Sy [11.2.10] 


The error term in [11.2.11] is white noise and uncorrelated with any of the ex- 
planatory variables. Moreover, dj = 0 for all j if and only if d; = 0 for all j. Thus, 
by truncating the infinite sums in [11.2.11] at some finite value, we can test the 
null hypothesis that y does not Granger-cause x with an F test of dj} = dy = 
+) = dt =0, 

A variety of other Granger-causality tests have been proposed; see Pierce 
and Haugh (1977) and Geweke, Meese, and Dent (1983) for selective surveys. 
Bouissou, Laffont, and Vuong (1986) discussed tests using discrete-valued panel 
data. The Monte Carlo simulations of Geweke, Meese, and Dent suggest that the 
simplest and most straightforward test—namely, that based on [11.2.10]—may 
well be the best. 

The results of any empirical test for Granger causality can be surprisingly 
sensitive to the choice of lag length (p) or the methods used to deal with potential 
nonstationarity of the series. For demonstrations of the practical relevance of such 
issues, see Feige and Pearce (1979), Christiano and Ljungqvist (1988), and Stock 
and Watson (1989). 


Interpreting Granger-Causality Tests 


How is “Granger causality” related to the standard meaning of “causality”? 
We explore this question with several examples. 


11.2. Bivariate Granger Causality Tests 305 


Example 11.1—Granger-Causality Tests 
and Forward-Looking Behavior 


The first example uses a modification of the model of stock prices described 
in Chapter 2. If an investor buys one share of a stock for the price P, at date 
t, then at ¢ + 1 the investor will receive D,,, in dividends and be able to sell 
the stock for P,,,. The ex post rate of return from the stock (denoted r,, |) is 
defined by 


(1 + ra )P, = Pray + Dest [11.2.12] 


A simple model of stock prices holds that the expected rate of return for the 
stock is a constant r at all dates:* 


(1+ r)P, = EfPi41 + Dail. {11.2.13] 


Here £, denotes an expectation conditional on all information available to 
stock market participants at time ¢. The logic behind [11.2.13] is that if investors 
had information at time ¢ leading them to anticipate a higher-than-normal return 
to stocks, they would want to buy more stocks at date t. Such purchases would 
drive P, up until [11.2.13] was satisfied. This view is sometimes called the 
efficient markets hypothesis. 

As noted in the discussion of equation [2.5.15] in Chapter 2, equation 
[11.2.13] along with a boundedness condition implies 


* 1 7 
P,= E, p> 4] Dia; [11.2.14] 
Thus, according to the theory, the stock price incorporates the market’s best 
forecast of the present value of future dividends. If this forecast is based on 
more information than past dividends alone, then stock prices will Granger- 
cause dividends as investors try to anticipate movements in dividends. 

For a simple illustration of this point, suppose that 


D,=d+u,+ 6u,_, + v,, {11.2.15] 


where u, and v, are independent Gaussian white noise series and d is the mean 
dividend. Suppose that investors at time ¢ know the values of {u,, u,_,. . . .} 
and {v,, v,,.. . .}. The forecast of D,,, based on this information is given by 


d+ 6u, forj = 1 
Pay = 11.2.16 
E(Dr+;) , forj = 2,3,.... | 
Substituting [11.2.16] into [11.2.14], the stock price would be given by 
P, = dir + 6u,/(1 + 7). [1.2.17] 


*A related model was proposed by Lucas (1978): 
U(C)P, = EABU'(C,4:)( P41 + Dy i)} 


with U'(C,) the marginal utility of consumption at date ¢. If we define P, to be the marginal-utility- 
weighted stock price P, = U'(C,)P, and D, the marginal-utility-weighted dividend, then this becomes 


BP, = EXP. + B,+}. 


which is the same basic form as [11.2.13]. With risk-neutral investors, U'(C,) is a constant and the two 
formulations are identical. The risk-neutral version gained early support from the empirical evidence 
in Fama (1965). 


306 Chapter 11 | Vector Autoregressions 


Thus, for this example, the stock price is white noise and could not be forecast 
on the basis of lagged stock prices or dividends.* No series should Granger- 
cause stock prices. 

On the other hand, notice from [11.2.17] that the value of u,_, can be 
uncovered from the lagged stock price: 


u,-, = (1+ nP,_, - (1 + ndir. 


Recall from Section 4.7 that u,_, contains additional information about D, 
beyond that contained in {D,_,, D,_2,. . .}. Thus, stock prices Granger-cause 
dividends, though dividends fail to Granger-cause stock prices. The bivariate 
VAR takes the form 


ba | dir 0 | a ee + "| 
= + + : 
D, -dlr l+r OO} LD,-, u,+ vy, 


Hence, in this model, Granger causation runs in the opposite direction 
from the true causation. Dividends fail to ““Granger-cause” prices, even though 
investors’ perceptions of dividends are the sole determinant of stock prices. 
On the other hand, prices do ‘‘Granger-cause”’ dividends, even though the 
market's evaluation of the stock in reality has no effect on the dividend process. 


In general, time series that reflect forward-looking behavior, such as stock 
prices and interest rates, are often found to be excellent predictors of many key 
economic time series. This clearly does not mean that these series cause GNP or 
inflation to move up or down. Instead, the values of these series reflect the market’s 
best information as to where GNP or inflation might be headed. Granger-causality 
tests for such series may be useful for assessing the efficient markets view or 
investigating whether markets are concerned with or are able to forecast GNP or 
inflation, but should not be used to infer a direction of causation. 

There nevertheless are circumstances in which Granger causality may offer 
useful evidence about the direction of true causation. As an illustration of this 
theme, consider trying to measure the effects of oil price increases on the economy. 


Example 11,2—Testing for Strict Econometric Exogeneity’ 

All but one of the economic recessions in the United States since World War 
Il have been preceded by a sharp increase in the price of crude petroleum. 
Does this mean that oil shocks are a cause of recessions? 

One possibility is that the correlation is a fluke—it happened just by 
chance that oil shocks and recessions appeared at similar times, even though 
the actual processes that generated the two series are unrelated. We can in- 
vestigate this possibility by testing the null hypothesis that oil prices do not 
Granger-cause GNP. This hypothesis is rejected by the data—oil prices help 
predict the value of GNP, and their contribution to prediction is statistically 
significant. This argues against viewing the correlation as simply a coincidence. 

To place a causal interpretation on this correlation, one must establish 
that oil price increases were not reflecting some other macroeconomic influence 
that was the true cause of the recessions. The major oil price increases have 


‘This result is due to the particular specification of the time series properties assumed for dividends, 
A completely general result is that the excess return series defined by P,,, + D,., — (1 + 7)P, (which 
for this example would equal 6u,,,/(1 + r) + t+) + ¥,4,) should be unforecastable. The example in 
the text provides a simpler illustration of the general issues. 


"This discussion is based on Hamilton (1983, 1985). 


11.2. Bivariate Granger Causality Tests 307 


been associated with clear historical events such as the Suez crisis of 1956-57, 
the Arab-Israeli war of 1973-74, the Iranian revolution of 1978—79, the start 
of the Iran-Iraq war in 1980, and Iraq’s invasion of Kuwait in 1990. One could 
take the view that these events were caused by forces entirely outside the U.S. 
economy and were essentially unpredictable. If this view is correct, then the 
historical correlation between oil prices and GNP could be given a causal 
interpretation. The view has the refutable implication that no series should 
Granger-cause oil prices. Empirically, one indeed finds very few mac- 
roeconomic series that help predict the timing of these oil shocks. 


The theme of these two examples is that Granger-causality tests can be a 
useful tool for testing hypotheses that can be framed as statements about the 
predictability of a particular series. On the other hand, one may be skeptical about 
their utility as a general diagnostic for establishing the direction of causation be- 
tween two arbitrary series. For this reason, it seems best to describe these as tests 
of whether y helps forecast x rather than tests of whether y causes x. The tests 
may have implications for the latter question, but only in conjunction with other 
assumptions. 

Up to this point we have been discussing two variables, x and y, in isolation 
from any others. Suppose there are other variables that interact with x or y as well. 
How does this affect the forecasting relationship between x and y? 


Example 11.3—Role of Omitted Information 
Consider‘the following three-variable system: 


Yu 1+ 6L L OQ} fe, 
Yar _ 0 1 0 Er, |, 
Yar 0 L 1 Ex, 
with 
a 0 0 
0 of 0 fort =s5 
E(e,e;) = 0 0 a3 
0 otherwise. 


Thus, y, can offer no improvement in a forecast of either y, or y. beyond that 
achieved using lagged y, and y,. 

Let us now examine the bivariate Granger-causality relation between y, 
and y3. First, consider the process for y,: 


Yu = Ev + 681 )-1 + fay 1+ 


Notice that y, is the sum of an MA(1) process (e€,, + 5€,,_,) and an uncorrelated 
white noise process (€,,_,). We know from equation [4.7.15] that the univariate 
representation for y, is an MA(1) process: 


Yue =U, + Ou,_)- 
From [4.7.16], the univariate forecast error u, can be expressed as 
Uy = (ey — Oe, + O7e,,-2 — BP ey-3 + --°) 
+ HE) — Obs -2 + O73 — OP ey-4 + + °°) 


+ (€qy-1 — 0824-2 + 974-3 — Peg t+). 


308 Chapter 11 | Vector Autoregressions 


The univariate forecast error u, is, of course, uncorrelated with its own lagged 
values. Notice, however, that it is correlated with y3,_,;: 


E(u)¥s-1) = Euless. + €24-2) = — 803. 


Thus, lagged y3 could help improve a forecast of y, that had been based 
on lagged values of y, alone, meaning that y, Granger-causes y, in a bivariate 
system. The reason is that lagged y, is correlated with the omitted variable y., 
which is also helpful in forecasting y,.* 


11.3. Maximum Likelihood Estimation of Restricted 
Vector Autoregressions 


Section 11.1 discussed maximum likelihood estimation and hypothesis testing on 
unrestricted vector autoregressions. In these systems each equation in the VAR 
had the same explanatory variables, namely, a constant term and lags of all the 
variables in the system. We showed how to calculate a Wald test of linear constraints 
but did not discuss estimation of the system subject to the constraints. This section 
examines estimation of a restricted VAR. 


Granger Causality in a Multivariate Context 


As an example of a restricted system that we might be interested in estimating, 
consider a vector generalization of the issues explored in the previous section. 
Suppose that the variables of a VAR are categorized into two groups, as represented 
by the (m, x 1) vector y,, and the (nm, x 1) vector y2,. The VAR may then be 
written 


Yu = + ALX, + Aj Xz, + Ei, [11.3.1] 
Yo, = C2 + Bix, + Bix, + €2,. [11.3.2] 


Here x,, is an (x,p X 1) vector containing lags of y,,, and the (m.p x 1) vector 
X,, contains lags of y>,: 


Yie-1 Yau-1 
= | Yie-2 = | Yau-2 
x), = : Xz, = : ‘ 

Yie-p Y2s-p 


The (nm; X 1) and (a, X 1) vectors c, and c, contain the constant terms of the 
VAR, while the matrices A,, A;, B,, and B, contain the autoregressive coefficients. 

The group of variables represented by y, is said to be block-exogenous in the 
time Series sense with respect to the variables in y, if the elements in y, are of no 
help in improving a forecast of any variable contained in y, that is based on lagged 
values of all the elements of y, alone. In the system of [11.3.1] and [11.3.2], y, is 
block-exogenous when A, = 0. To discuss estimation of the system subject to this 
constraint, we first note an alternative form in which the unrestricted likelihood 
can be calculated and maximized. 


"The reader may note thai for this example the correlation between y,, and y;,_, is zero. However, 
there are nonzero correlations between (1) y,, and y,,_, and (2) y,,., and y;,-,, and these account 
for the contribution of y;,., to a forecast of y,, that already includes y, ,_,. 


11.3. Estimation of Restricted Vector Autoregressions 309 


An Alternative Expression for the Likelihood Function 


Section 11.1 calculated the log likelihood function for a VAR using the pre- 
diction-error decomposition 


T 
£0) = D 108 frix(yilx,3 ®), [11.3.3] 


where yi = ie You) x; 7 (Y= Yi-2> > ie hg. Yi-p)s and 
log fy,1x,(y1X,3 0) 


nA, + A 


aX, 2 
= -L5-? log(2m) — 4 log eis 


a2 On» 
— Hy, — 1) — Aik, — Aaxa)" (Ya — & - Bix, — B3x2,)'] 
a eel — ¢ — Aix), — 1 
MO, OD» Yo, — €2 — Bixy — B3x2,]} 
Alternatively, the joint density in [11.3.4] could be written as the product of a 
marginal density of y,, with the conditional density of y,, given y,,: 
Fux (yelxss 0) = Fruix, (Yul Xs 9) Frsivi.x, YarlYue X,; 9). {11.3.5] 
Conditional on x,, the density of y,, is 
Fux, YulXs 0) = (2n)-"7|0,,,|-'? 
x exp[—4(¥y — €; — AlX), — ASx%,)'Q7! [11.3.6] 
X (yy — C1 — Aix, — Adx,)], 
while the conditional density of y2, given y,, and x, is also Gaussian: 
Fealyy.x,(YaelYue %5 8) = (2r)?| HB] 71? [11.3.7] 
x exp[—4y2, — m2,)'H~'(y2, — m;,)]. 


The parameters of this conditional distribution can be calculated using the results 
from Section 4.6. The conditional variance is given by equation [4.6.6]: 


; H = O22 — 22,07)'Q)2; 
while the conditional mean (m,,) can be calculated from [4.6.5]: 
m2, = E(y2,|x,) + 2,,.Q7'ly, - E(yulx,)]- [11.3.8] 
Notice from [11.3.1] that 
E(yulx,) =¢ + Aix, + A2X2,, 


[11.3.4] 


while from [11.3.2], 
E(yx,|x,) = ¢2 + Bix,, + B3xz,. 
Substituting these expressions into [11.3.8], 
mz, = (C2 + Bix, + Box.) + O,O7'[yy — (ec, + Aix, + A3x2,)] 
= d+ Dey, + Dix, + D2x,,, 


where 
d=c, — 2,,0)7'c, [11.3.9] 
Di = 2,27! {11.3.10] 
D, = By — 2,,07'A; {11.3.11] 
Di = By — 92,97'A3. [11.3.12] 


310 Chapter 11 | Vector Autoregressions 


The log of the joint density in [11.3.4] can thus equivalently be calculated as 
the sum of the logs of the marginal density [11.3.6] and the conditional density 
[11.3.7]: 


log fy,x,(¥1x5 8) = €y + fa [11.3.13] 

where 
€ = (—A,/2) log(2m) — 4 log|Q,\| 

— Udyw — er — Aix, — Anxe)' OR yy — ¢ — Aix, — Adx,)] 
£2, = (—n,/2) log(27) — 4 log|H| 

— §{(y, — d — Doy,, — Dix, — D2x,,)"H~' [11.3.15] 

X (yz, — d — Doyy — Dixy, — D2x2,)]- 
The sample log likelihood would then be expressed as 


T T 
£0) = 3 ey + D by [11.3.16] 
t=1 t=1 


Equations [11.3.4] and [11.3.13] are two different expressions for the same 
magnitude. As long as the parameters in the second representation are related to 
those of the first as in [11.3.9] through [11.3.12], either calculation would produce 
the identical value for the likelihood. If [11.3.3] is maximized by choice of (c,, A;. 
Ao, C2, By, Bz, 2);, 2,2, 2), the same value for the likelihood will be achieved 
as by maximizing [11.3.16] by choice of (¢,, Ay, Ao, d, Da, D;, D2, 2,,, H). 

The second maximization is as easy to achieve as the first. Since the parameters 
(c,, Ai, Ay) appear in [11.3. 16] only through 2/_, €,,, the MLEs of these parameters 
can be found by OLS regressions of the elements of y,, on a constant and lagged 
values of y, and y,, that is, by OLS estimation of 

Yy = ¢; + Ajx,, + ASX, + &),. {11.3.17] 
The MLE of ©. is the sample variance-covariance matrix of the residuals from 
these regressions, Q, = (1/T) 27. ,é,,8,. Similarly, the parameters (d, Do, D,, 
D,) appear in [11.3.16] only through 27. , ey, and so their MLEs are obtained from 
OLS regressions of the elements of y2, on a constant, current and lagged values of 
y,, and lagged values of y,: 


Yy = d + Doyy, + Dix, + Dox, + vz, {11.3.18] 


(11.3.14] 


The MLE of H is the sample variance-covariance matrix of the residuals from this 
second set of regressions, H = (1/T) 27. , Va¥5, 

Note that the population residuals aisnciated with the second set of regres- 
sions, ¥2,, are uncorrelated with the population residuals of the fitst regressions. 
This is because v2, = yx, ~ E(yalyu, x,) is uncorrelated by construction with y,, 
and x,, whereas €,, is a linear function of Yu and x,. Similarly, the OLS sample 
residuals associated with the second regressions, 


¥o, = Yn — d - Diyy = Dix, = Dix), 


are orthogonal by construction to y,,, a constant term, and x,. Since the OLS sample 
residuals associated with the first regressions, &,,, are linear functions of these same 
elements, ¥,, is orthogonal by construction to &,,. 


Maximum Likelihood Estimation of a VAR Characterized 

by Block Exogeneity 

Now consider maximum likelihood estimation of the system subject to the 
constraint that A, = 0. Suppose we view (d, D,, D,, D2, H) rather than (c2, B,, 


11.3. Estimation of Restricted Vector Autoregressions 311 


B,, Q2,, Mp2) as the parameters of interest for the second equation and take our 
objective to be to choose values for (c,, A,;, 2), d, Dy, D,, D, H) so as to maximize 
the likelihood function. For this parameterization, the value of A, does not affect 
the value of €,, in [11.3.15]. Thus, the full-information maximum likelihood esti- 
mates of ¢,, A,, and 2, can be based solely on a restricted version of the regressions 
in [11.3.17], 


Yu = ce, + Ax, + &y {11.3.19] 
Let é,(0), A,(0), 9,,(0) denote the estimates from these restricted regressions. The 
maximum likelihood estimates of the other parameters of the system (d, Dy, D,, 
D,, H) continue to be given by unrestricted OLS estimation of [11.3.18], with 
estimates denoted (d, D,, D,, B,, BD. 
The maximum value achieved for the log likelihood function can be found 
by applying [11.1.32] to [11.3.13]: 


£(6(0)] Sa €.[é,(0), A (0), 2,,(0)] + > €,,[d, Do, D,, D,, Hi] 


[ (Ta, 2) log(2m) + (7/2) toed") — (Tn,/2)] [11.3.20] 
+ [-(Tn,/2) log(2m) + (7/2) log|H-'| — (Tn,/2)]. 


By contrast, when the system is estimated with no constraints on A;, the value 
achieved for the a likelihood is 


tt 


£(6): €yf@,, Ay, As, Qu] + > €2,{d, Dy, D,, D2, Al 


t=] 
[—(Tnj/2) log(2m) + (T/2) bogldin' —(Tn,/2)) (113.21) 
+ [-(Tn, H-'| — (Tn,/2)], 


where (é,, A;, A;, a, 1) denote estimates based on OLS estimation of [11.3.17]. A 
likelihood ratio test of the null hypothesis that A, = 0 can thus be based on 


AL] — £[6(0)]} = Thlog|Qi"| — tog] $2;;'(0))} 
= T{log|®,,(0)| — log] %,,}. 


This will have an asymptotic y? distribution with degrees of freedom equal to the 
number of restrictions. Since A; is an (n, X np) matrix, the number of restrictions 
iS MAP. 

Thus, to test the null hypothesis that the 1, variables represented by y, are 
block-exogenous with respect to the n variables represented by y2, perform OLS 
regressions of each of the elements of y, on a constant, p lags of all of the elements 
of y,, and p lags of all of the elements of y2. Let €,, denote the (m, x 1) vector of 
sample residuals for date t from these regressions and 22,, their variance-covariance 
matrix (Q,, = (1/T)Z7_, &,,&),). Next perform OLS regressions of each of the 
elements of y, on a constant and p lags of all the elements of y,. Let €,,(0) denote 
the (n, X 1) vector of sample residuals from this second set of regressions and 
2,,(0) their variance-covariance matrix (2,,(0) = (1/T) 27, [&,,(0)][&,,(0)]'). 
If 


[11.3.22] 


T{log|2,,(0)| ~ log|®,,]} 


is greater than the 5% critical value for a y?(n,n2p) variable, then the null hypothesis 
is rejected, and the conclusion is that some of the elements of y, are helpful in 
forecasting y,. 

Thus, if our interest is in estimation of the parameters (c,, A,, 22,,, d, Dy, 
D,, D2, H) or testing a hypothesis about block exogeneity, all that is necessary 


312 Chapter 11 | Vector Autoregressions 


is OLS regression on the affected equations. Suppose, however, that we wanted 
full-information maximum likelihood estimates of the parameters of the likelihood 
as originally parameterized (c,, A,, Q);, ¢2, B,, Bz, 22,, Nez). For the parameters 
of the first block of equations (c,, A,, 2,,), the MLEs continue to be given by 
OLS estimation of [11.3.19]. The parameters of the second block can be found 
from the OLS estimates by inverting equations [11.3.9] through [11.3.12]:° 


9,,(0) = Bf2,,(0)] 


€,(0) = d + [%21(0)][,@)]- [0] 
[B,(0)]’ = By + [22,(0)[1,(0)]- [ACY 
[B,(0)]' = D5 


94,(0) = H+ [Q2,(0)][2,,(0)]- [2,,(0)]. 


Thus, the maximum likelihood estimates for the original parameterization of [11.3.2] 
are found from these equations by combining the OLS estimates from [11.3.19] 
and [11.3.18]. 


Geweke’s Measure of Linear Dependence 


The previous subsection modeled the relation between an (n, X 1) vector y,, 
and an (m, X 1) vector y2, in terms of the pth-order VAR [11.3.1] and [11.3.2], 
where the innovations have a variance-covariance matrix given by 


lee 7 - EI a 

E/E},  €2/ £2, MO, Ny 

To test the null hypothesis that y, is block exogenous with respect to y., we proposed 
calculating the statistic in [11.3.22], 


T{log]2,,(0)| — log] Qy|} = x22 P)s [11.3.23] 


where 2, is the variance-covariance matrix of the residuals from OLS estimation 
of [11.3.1] and 22,,(0) is the variance-covariance matrix of the residuals from OLS 
estimation of [11.3.1] when lagged values of y, are omitted from the regression 
(that is, when A, = 0 in [11.3.1]). 

Clearly, to test the parallel null hypothesis that y. is block-exogenous with 
respect to y,, we would calculate 


T{log|%22(0)| — log|z2|} ~ x%(n2n.p), [1.3.24] 


where A is the variance-covariance matrix of the residuals from OLS estimation 
of [11.3.2] and ,,(0) is the variance-covariance matrix of the residuals from OLS 
estimation of [11.3.2] when lagged values of y, are omitted from the regression 
(that is, when B, = 0 in [11.3.2]). 

Finally, consider maximum likelihood estimation of the VAR subject to the 
restriction that there is no relation whatsoever between y, and y,, that is, subject 


°To confirm that the resulting estimate a0) is symmeteric and positive definite, notice that 


,,(0) =A+ Bil, ,(0)10, 


2,002) 9,00)} _ |, 2.0) OFF L, B, 
9,0) 9,,(0){ [D5 Lf @ Alo 4) 


11.3. Esti at'on of Restricted Vertar Anutarsaraceinns a12 


and so 


to the restrictions that A, = 0, B, = 0, and 9,, = 0. For this most restricted 
specification, the log likelihood becomes 
rz 


£0) = > | (oul log(2m) — (1/2) log|®,,| 


= 


— (12)(yn — ey — Aix,)'Q7'(y, — «1 - ain} 
+ y { ~(ara log(2m) — (1/2) log|..| 


— (1/2)( ya, — €2 — B3Xx2,)'25'(yo, — €2 - Bix.) 


and the maximized value is 
£(6(0)) = {—(Tn,/2) log(2m) — (TI2) log|,,(0)| — (Tn,/2)} 
+ {—(Tn/2) log(2m) — (T/2) log|M.(0)| — (Tn2/2)}. 


A likelihood ratio test of the null hypothesis of no relation at all between y, and 
2, Qi 


y2 is thus given by 
- - + A ate a A , 
T {login log|$20(0)| — log] 


2{£(6) — £(6(0))} 
where 92, is the covariance matrix between the residuals from unrestricted OLS 
estimation of [11.3.1] and [11.3.2]. This null hypotbesis imposed the (n,n2p) re- 
Strictions that A, = 0, the (n2”,p) restrictions that B, = 0, and the (n2n,) restric- 
tions that 2, = 0. Hence, the statistic in [11.3.25] has a x? distribution with 
(n\n) X (2p + 1) degrees of freedom. 

Geweke (1982) proposed (1/T) times the magnitude in [11.3.25] as a measure 
of the degree of linear dependence between y, and y>. Note that [11.3.25] can be 
expressed as the sum of three terms: 


[11.3.25] 


a 


A A 


: A a, 2 
T 4 log|,,(0)| + log|2,.(0 ! ae 
{loeld(O| + tot 2(O| tele } 
= T{log|®,,(0)| — log|2,,]} + T{log|92,2(0)| — log|Q2,|}  [11.3.26] 
F z 2, 2 
+ Ty log|,,| + log|2 I ae \. 
{eel ul og|22.| — log ,, an 


The first of these three terms, T{log|®,,(0)| — log|,, |}. is a measure of the 
strength of the linear feedback from y, to y, and is the x ?(n,n2p) statistic calculated 
in [11.3.23]. The second term, T{log|{..(0)| — log|{,9|}, is an analogous measure 
of the strength of linear feedback from y, to y, and is the y?(n2n,p) statistic in 
{11.3.24]. The third term, 
, Qi, Oi } 
Tog! + torial - toga" Gh 


is a measure of instantaneous feedback. This corresponds to a likelihood ratio test 
of the null hypothesis that 2.., = 0 with A, and B, unrestricted and has a y7(n,n2) 
distribution under the null. 

Thus, [11.3.26] can be used to summarize the strength of any linear relation 
between y, and y, and identify the source of that relation. Geweke showed how 
these measures can be further decomposed by frequency. 


nN A 


314 Chapter 11 | Vector Autoregressions 


Maximum Likelihood Estimation Under General 
Coefficient Constraints 


We now discuss maximum likelihood estimation of a vector autoregression 
in which there are constraints that cannot be expressed in a block-recursive form 
as in the previous example. A VAR subject to general exclusion restrictions can 
be viewed as a system of “seemingly unrelated regressions” as originally analyzed 
by Zellner (1962). 

Let x,,be a (k, x 1) vector containing a constant term and lags of the variables 
that appear in the first equation of the VAR: 


Yu = Xi Br + ey. 


Similarly, let x,, denote a (k, X 1) vector containing the explanatory variables for 
the second equation and x,, a (k,, X 1) vector containing the variables for the last 
equation. Hence, the VAR consists of the system of equations 


Yu= xi By + &, 


= X3,B2. + 
Y2e Bz + &2, [11.3.27] 
Ya = XuBu + Eve: 
Let k = k, + kp + +++ + k, denote the total number of coefficients to be 
estimated, and collect these in a (k x 1) vector: 
B, 
B _ B2 : 
B,, 


Then the system of equations in [11.3.27] can be written in vector form as 
y = &B + €,, {11.3.28] 


where #, is the following (n X k) matrix: 


' 

Xi, x, O' -- 0 

,_ |} | 0 x, + 0" 

t= S = e m . 
tf 

Xin O Ott Xn 


Thus, aj, is defined as a (1 x k) vector containing the k; explanatory variables for 
equation i, with zeros added so as to be conformable with the (kK x 1) vector B. 
The goal is to choose B and 2 so as to maximize the log likelihood function 


£(B, 2) = — (Tn/2) log(2m) + (7/2) log|Q-'| 


ee {11.3.29] 
~ (12) 2 (y, ~ BiB) O-"(y, — iB). 
This calls for choosing B so as to minimize 
T 
> (y, — ¥/B)'Q-"{y, — ¥7B). [1.3.30] 


t=] 


11.3. Estimation of Restricted Ve to Autnrecreccinne 21% 


If Q-' is written as L’L, this becomes 


T T 

> (y, — ¥/B)'Q-"y, — ¥p) = > (Ly, — L¥/B)’(Ly, — LB) 

i=l 1 {11.3.31] 
T -~ -_ 

= D (y, — ¥/B)'(5, - £8), 


where ¥, = Ly, and 


= 


. RS 
po =" 


&, 


at 
But [11.3.31] is simply 


2 (§, — ¥,B)'(§, — %:B) 


In at 1B Fue - iB 
Tv 
= > [hu — 1.8)? + Ga, — 2B)? ++ + Gu — BBY), 


which is minimized by an OLS regression of j,, on %;,, pooling all the equations 


(i = 1,2,....,) into one big regression. Thus, the maximum likelihood estimate 
is given by 
T -1 
A {3 (Bi) + Baba) + + + Geil} 
{11.3.32] 


x {3 [ZF + (ZxYu) eS iyie t Gaull}. 


Noting that the variance of the residual of this pooled regression is unity by 
construction,'” the asymptotic variance-covariance matrix of 8 can be calculated 
from 


fT: 


E(B - B)\(B - B)' = {2 (1% ),) + (By) + +> + (un) : 


Construction of the variables §,, and %,, to use in this pooled OLS regression 
requires knowledge of L and hence 22. The parameters in B and © can be estimated 
jointly by maximum likelihood through the following iterative procedure. From 
OLS regressions of y, on x;, form an initial estimate of the coefficient vector 


That is, 
E(§, — £:8)(9, — #8) = LOL’ = L(L'L) 'L’ = 1,. 


316 Chapter 11 | Vector Autoregressions 


B(0) = (bj b, ::-  b,)’.Use this to form an initial estimate of the variance 
matrix, 


20) = (IT) D Ly. — BOIL. - HOY 


Find a matrix (0) such that [L(0)']L(0) = [Q(0)]-', say, by Cholesky factor- 
ization, and form y,(0) = L(0)y, and ¥/(0) = L(0)%!. A pooled OLS Tegression 
of ¥,(0) on %,(0) combining i = 1,2, .. . >” then yields the new estimate B(1), 
from which Q(1) = (1/T)2/ily, ~ SBC) [y, — BC)", Iterating in this 
manner will produce the maximum likelihood estimates (@, 1), though the esti- 
mate after just one iteration has the same asymptotic distribution as the final 
MLE (see Magnus, 1978). 

An alternative expression for the MLE in [11.3.32] is sometimes used. Notice 
that 


(EH 1,) + (Bra) t+ + Sn%,,)] 
Bi, 
= [%, &, -°* Spl ma 
zi, 
= ¥,8' 
= #,L'LY,; 
X, O cs Offa! ao -. a FTxi, Oo + 0 ad 
ops sap oe oF? 984 ee Or gs ae 
: 0 tee “4 om gi? tee om 0° 0° 2 x, 


tly yf a eC 
TX YX, OX, X2, a "XXne 
o?'X2,Xi, o2x,X5, — OX Xi 


tl 


U o''Xy Xi, OX 5X2, ee o'%KuXne 
where o/ denotes the row i, column / element of Q~-'. Similarly, 


(udu) + Baan) t+ + (BuInd] 


du 
ad [%, Xz, £1] Jz 
Suu 
= ¥#,L'Ly, 
x, 0 0 fic! ot a" Hy, [11.3.34] 


0 X, «: 0 a?! gt «se git ya, 
0 0 Gave Xu og” g? ete gt Yon 


ll t2 ee t 

TX +o XiYar ates to "Xue 
23 22 woe 

oO Xa is +o XwYar + + "Xan 


OXY + OXY bot Or KY nt 


11.3. Estimation of Restricted Vector Autoregressions 317 


Substituting [11.3.33] and [11.3.34] into [11.3.32], the MLE satisfies 


o'Sx,xX}, oF Dx,xh, ot ol Ex, xi 7! 
F O7"ZXx2,Xj,  077DX2,Xa, tt OM LXy, Xi, 
1 2 : 

o” 2X Xi, o” 2X 1,X2, nie ZX Xt {11.3.35] 

MW 12 Oe 1 
(OXY + OXY Fort + OM KY) 

21 22 - 2n 
2(a7!X2,¥y, + 7? XyYo, $e + XH) 


’ 


2(o""' Xu + OX Yor shy? ote OXY) 


where > denotes summation overt = 1,2,..., T. 

The result from Section 11.1 was that when there are no restrictions on the 
VAR, maximum likelihood estimation is achieved by OLS equation by equation. 
This result can be seen as a special case of [11.3.35] by setting x,, = x, = ++: = 
X,,. for then [11.3.35] becomes 


6 = [a-! ® (2x,x/)]-'2[(Q-'y,) ® x,] 
= [2 @ (2x,x/)""}2[(Q-'y,) @ x] 
[L, ® (2x,x;)~'}2[y, @ x] 


i 


(2x,x;) a 0 pare 0 Zy ux, 
0 (2x,x,)' os 0 22% 
0 0 oo (2x,x;) rs Zink, 

b, 

b, 


as shown directly in Section 11.1. 
Maximum likelihood estimation with constraints on both the coefficients and 
the variance-covariance matrix was discussed by Magnus (1978). 


11.4. The Impulse-Response Function 


In equation [10.1.15] a VAR was written in vector MA() form as 
y=pte, + Wye,_, + We. +¢6°°. {11.4.1] 
Thus, the matrix W, has the interpretation 
BY, +5 
==, 11.4.2 
Mat = (11.4.2 
that is, the row i, column j element of W, identifies the consequences of a one- 
unit increase in the jth variable’s innovation at date t (¢,) for the value of the ith 
variable at time t + s (y,,,,), holding all other innovations at all dates constant. 
If we were told that the first element of ©, changed by 6, at the same time 
that the second element changed by 6, . . . , and the ath element by 6, then the 


318 Chapter 11 | Vector Autoregressions 


combined effect of these changes on the value of the vector y,,, would be given 
by 


dy, +3 Bs +5 BY 45 
aE, 6, + ais 6, + + ae, 6, = W,5, [11.4.3] 
where § = (6,, 6,,..., 5,)’- 

Several analytic characterizations of W, were given in Section 10.1. A simple 
way to find these dynamic multipliers numerically is by simulation. To implement 
the simulation, set y,_, = y,-2 = °** = Y;-» = 0. Set e,, = 1 andall other elements 
of ¢, to zero, and simulate the system [11.1.1] for dates t,¢ + 1,¢ + 2,..., with 
cand €,,1, £42, --. all zero. The value of the vector y,,, at date ¢ + s of this 
simulation corresponds to the jth column of the matrix W,. By doing a separate 
simulation for impulses to each of the innovations (j = 1, 2,..., ”), all of the 
columns of W, can be calculated. 

A plot of the row i, column j element of W,, 


OV ies 
OE; 
as a function of s is called the impulse-response function. It describes the response 
of y,,4s to a one-time impulse in y,, with all other variables dated ¢ or earlier held 
constant, 

Is there a sense in which this multiplier can be viewed as measuring the causal 
effect of y, on y,? The discussion of Granger-causality tests suggests that we should 
be wary of such a claim. We are on surer ground with an atheoretical VAR if we 
confine ourselves to statements about forecasts. Consider, therefore, the following 
question. Let 


AY,45 = 


5 [11.4.4] 


a1 (Yi-1 Yi-2, na Yi-p) 


denote the information received about the system as of date t — 1. Suppose we 
are then told that the date ¢ value of the first variable in the autoregression, y,,, 
was higher than expected, so that €,, is positive. How does this cause us to revise 
our forecast of y,,,,? In other words, what is 


IE (yiresl Yue Xx, — ) 9 
OY, 


The answer to this question is given by [11.4.4] with j = 1 only in the special case 
when E(e,e,) = is a diagonal matrix. In the more general case when the elements 
of €, are contemporaneously correlated with one another, the fact that €,, is positive 
gives us some useful new information about the values of €,,..., &,,. This in- 
formation has further implications for the value of y,,,,. To summarize these 
implications, we need to calculate the vector 


dE(e | Yun X,-1) 
OY; 


and then use [11.4.3] to calculate the effect of this change in all the elements of 
e, on the value of y;,,,,. 

Yet another magnitude we might propose to measure is the forecast revision 
resulting from new information about, say, the second variable, y,,, beyond that 
contained in the first variable, y,,. Thus, we might calculate 


[11.4.5] 


IE (Vis asl Yar Yu X—1) 


11.4.6 
ayn [11.4.6] 


11.4, The Impulse-Response Function 319 


Similarly, for the variable designated number 3, we might seek 


BEC 14] Yar Y20 Yin X; ~1) [ll 4 7 
dys, : 
and for variable n, 


IEVirasl Yur Vuntaer ++ Sie X,—-1) 
Yin 


This last magnitude corresponds to the effect of ¢,,, with &,,... , €,— 1, constant 
and is given simply by the row i, column n element of W,. 

The recursive information ordering in [11.4.5] through [11.4.8] is quite com- 
monly used. For this ordering, the indicated multipliers can be calculated from the 
moving average coefficients (W,) and the variance-covariance matrix of e, (M2) by 
a simple algorithm. Recall from Section 4.4 that for any real symmetric positive 
definite matrix 0, there exists a unique lower triangular matrix A with 1s along 
the principal diagonal and a unique diagonal matrix D with positive entries along 
the principal diagonal such that 


[11.4.8] 


Q = ADA’. [11.4.9] 
Using this matrix A we can construct an (n X 1) vector u, from 
u, = A7'e,. {11.4.10] 


Notice that since €, is uncorrelated with its own lags or with lagged values of y, it 
follows that" u, is also uncorrelated with its own lags or with lagged values of y. 
The elements of u, are furthermore uncorrelated with each other: 


E(u,u,) = [A~'JE(e,e;)[A~']’ 


= [A-'Joja'}-? 
= [A-'JADA'[A‘]-! {11.4.11] 
=D. 


But D is a diagonal matrix, verifying that the elements of u, are mutually uncor- 
related. The (j, j) element of D gives the variance of u,,. 
If both sides of [11.4.10] are premultiplied by A, the result is 


Au, = &,. {11.4.12] 
Writing out the equations represented by [11.4.12] explicitly, 
1 0 O +++ Offa; Ei; 
a, 1 QO =<: Off uy, £2, 
Gy, Gx. 1 «+: Of] us| =] &, |- {11.4.13] 
an a, a3 vd Lyn Ew 


Thus, u;, is Simply €,,. The jth row of [11.4.13] states that 


Uj, = Ej, —~ Gj lyy — Gjrtla, — 8° — Gy je jee 
But since u;, is uncorrelated with u),, Uz, ..., 414, it follows that u;, has the 
interpretation as the residual from a projection of Ey ON My, Uys ey Miya 
ECE; | bigs Mars os Ujena) = Gj, + Gjatla, Hove + Gy; Uyayy-  {11.4.14] 


The fact that the u,, are uncorrelated further implies that the coefficient on 
u,, in a projection of € on (u,,, Uy, . . . , 4-1.) is the same as the coefficient on 


320 Chapter 11 | Vector Autoregressions 


u,, in a projection of ¢,, on u,, alone: 
E(e,|uy) = ajitt,,. [11.4.15] 
Recalling from [11.4.13] that ¢,, = u,,, we see that new information about the 


value of €,, would cause us to revise our forecast of ¢,, by the amount 


aE(e ECE, leu) _ aE(e, dE (E;,| M1) 


ae, au, = Gj. [11.4.16] 


Now ¢;, has the interpretation as y,, ~ E(y,,|x,;) and € i has the interpretation 
as Yi, - Evy, | x,-,). From the formula for updating a linear projection [4.5.14], 
the coefficient on y,, in a linear projection of y, on y,, and x,_, is the same as the 
coefficient on ¢,, in a linear projection of ¢;, on &,,."' Hence, 


aE (Es 115 X,-1) ae 


Pry Gj. {11.4.17] 
Combining these equations for j = 1,2,... , n into a vector, 
E 
0 (8,] 91. X—1) =a, [11.4.18] 


9Yy 


where a, denotes the first column of A: 


a 
Substituting [11.4.18] into [11.4.3], the consequences for y,,, of new information 
about y,, beyond that contained in x,_, are given by 


IE(Y 461 Yurs % X,-1) 
Wi 


Similarly, the variable u., represents the new information in y., beyond that 
contained in (y,,, X,-,). This information would, of course, not cause us to change 
our assessment of €,, (which we know with certainty from y,, and x,_,), but from 
{11.4.14] would cause us to revise our estimate of ¢,, for j = 2,3,...." by 

aL (E;,| urs U,) = 
OU, 
Substituting this into [11.4.3], we conclude that 


= W,a,. 


2 


JE(y, .5| Yar Vics X—1) = 


Wea, 
OY a " 


"That is 
E(ypl Ye %-1) = E(yulx-1) 
+ Covily, — Evyalx,-1)], Diu - E(yulx-v}} 
x {Varlyy — E(yulx BO Dye — Eales) 
= E(yplx-1) + Cov(eg, 81): {Var(ey)}7!-8 


11.4. The Impulse-Response Function 321 


where 


a = 


Gy2 
In general, 
IE(Y, 45] Yr Vidas Yur XH) 
By, 
where a; denotes the jth column of the matrix A defined in [11.4.9]. 

The magnitude in [11.4.19] is a population moment, constructed from the 
population parameters W, and © using [11.4.9]. For a given observed sample of 
size T, we would estimate the autoregressive coefficients ®,,. . . , ®, by OLS and 
construct W, by simulating the estimated system. OLS estimation would also pro- 
vide the estimate (2 = (1/T)2j_, €,€;, where the ith element of &, is the OLS 
sample residual for the ith equation in the VAR for date t. Matrices A and D 
satisfying © = ADA’ could then be constructed from © using the algorithm de- 
scribed in Section 4.4. Notice that the eleménts of the vector a, = A~'é, are then 
mutually a age by construction: 


= a, [1.4.19] 


says 


A A 


(/T) > ia) = (1/T) > A-',6/(A-")' = A-"Q(A-! 


tl 
o 


The sample estimate of [11.4.19] is then 
v,a,, [1.4.20] 
where 4, denotes the jth column of the matrix A. 

A plot of [11.4.20] as a function of s is known as an orthogonalized impulse- 

response function. It is based on decomposing the original VAR innovations (€,,, 

. » &) into a set of uncorrelated components (u,,, ... , U,) and calculating 
the consequences for y,,, of a unit impulse in u;,. These multipliers describe how 
new information about y;, causes us to revise our forecast of y,,,,, though the implicit 
definition of “new” information is different for each variable j. 

What is the rationale for treating each variable differently? Clearly, if the 
VAR is being used as a purely atheoretical summary of the dynamics of a group 
of variables, there can be none—we Could just as easily have labeled the second 
variable y,, and the first variable y2,, in which case we would have obtained different 
dynamic multipliers. By choosing a particular recursive ordering of the variables, 
the researcher is implicitly asking a set of questions about forecasting of the form 
of [11.4.5] through [11.4.8]. Whether we should orthogonalize in this way and how 
the variables should be ordered would seem to depend on why we want to ask 
such questions about forecasting in the first place. We will explore this issue in 
more depth in Section 11.6. 

Before leaving the recursive orthogonalization, we note another popular form 
in which it is implemented and reported. Recall that D is a diagonal matrix whose 
(j, j) element is the variance of u,, Let D'? denote the diagonal matrix whose 
(j, 7) element is the standard deviation of u;,, Note that [11.4.9] could be written as 


2 = AD'@p!2A’ = PP’, {11.4.21] 
where 
P = AD!”, 


322 Chapter 11 | Vector Autoregressions 


Expression [11.4.21] is the Cholesky decomposition of the matrix 9. Note that, 

like A, the (n X 2”) matrix P is lower triangular, though whereas A has 1s along 

its principal diagonal, P has the standard deviation of u, along its principal diagonal. 
In place of u, defined in [11.4.10], some researchers use 


v,=P-'e, = D-'PA-'s, = D-'?u,. 


Thus, v;, is just a;, divided by its standard deviation Vd,,. A one-unit increase in 
v,, is the same as a one-standard-deviation increase in u,,. 
In place of the dynamic multiplier -dy,,,,/0u;,, these researchers then report 

dy,,,,,/dv, The relation between these multipliers is clearly 

:) ee 

OYies RATES: Vd. = Wa; V4d,,- 

av;, Ou, v 
But a; Vj; is just the jth column of AD'”, which is the jth column of the Cholesky 
factor matrix P. Denoting the jth column of P by p,;, we have 


OY; +5 

OVi+s ; 5 

a, W,p;. {11.4.22] 
Expression [11.4.22] is just [11.4.19] multiplied by the constant VVar(w,,). 

Expression [11.4.19] gives the consequences of a one-unit increase in y,,, where 

the units are those in which y,, itself is measured. Expression [11.4.22] gives the 


consequences if y,, were to increase by V/Var(u,,) units. 


11.5. Variance Decomposition 


Equations [10.1.14] and [10.1.16] identify the error in forecasting a VAR s periods 
into the future as 


Yies 7 Yrrsts = €)4, + We,,-1 + Woe ,-2 to +t Wea [11.5.1] 
The mean squared error of this s-period-ahead forecast is thus 
MSE(S, 451) = El(yies Yast (Yeas a Srest)'] [11.5.2] 


= 2+ 8,0OW, + 2,08, +--+ 8 OWS), 
where 
Q = Ee,e;). [11.5.3] 
Let us now consider how each of the orthogonalized disturbances (1,,,.. . , 
u,,) contributes to this MSE. Write [11.4.12] as 
€, = Au, = ak, + Ag, t+ +> + A,U,,, [11.5.4] 
where, as before, a; denotes the jth column of the matrix A given in [11.4.9]. 
Recalling that the u,,’s are uncorrelated, postmultiplying equation [11.5.4] by its 
transpose and taking expectations produces 
2 = E(e,e;) [11.5.5] 
aja,* Var(u,,) + aca: Var(in,) + >>> + aa) Var(u,,), 


where Var(u;,) is the row j, column j element of the matrix D in [11.4.9]. Substituting 
[11.5.5] into [11.5.2], the MSE of the s-period-ahead forecast can be written as the 
sum of 1 terms, one arising from each of the disturbances u,,: 
MSE(§;+. 51) = p> {Var(u,,)-[ajaj + WV aja; Vi [11.5.6] 
+ Waa; pani 2 W,_ aja; ¥,_,]}. 


11.5. Variance Decomposition 323 


With this expression, we can calculate the contribution of the jth orthogonalized 
innovation to the MSE of the s-period-ahead forecast: 


Var(u;,) [aaj + Wyajay Wi + WjajayW, +--+ + Wy ajay h,_ |]. 
Again, this magnitude in general depends on the ordering of the variables. 

As s — © for a covariance-stationary VAR, MSE(¥,..,;,) > Py, the uncon- 
ditional variance of the vector y,. Thus, [11.5.6] permits calculation of the portion 
of the total variance of y, that is due to the disturbance u; by letting s become 
suitably large. 

Alternatively, recalling that a;-VVar(u,,) is equal to p,, the jth column of 
the Cholesky factor P, result [11.5.6] can equivalently be written as 


MSE(S,5\:) = > [pp] + Vip; i + opp; 8) 
4 
+0 + Wy ppp; Bs_)]. 


[11.5.7] 


11.6. Vector Autoregressions and Structural 
Econometric Models 


Pitfalls in Estimating Dynamic Structural Models 


The vector autoregression was introduced in Section 10.1 as a statistical de- 
scription of the dynamic interrelations between n different variables contained in 
the vector y,. This description made no use of prior theoretical ideas about how 
these variables are expected to be related, and therefore cannot be used to test 
our theories or interpret the data in terms of economic principles. This section 
explores the relation between VARs and structural econometric models. 

Suppose that we would like to estimate a money demand function that ex- 
presses the public’s willingness to hold cash as a function of the level of income 
and interest rates. The following specification was used by some early researchers: 


M, ~ P, = Bo + BY, + Bol, + Bx(M,_, a P,_,) + vP. [11.6.1] 


Here, M, is the log of the nominal money balances held by the public at date r, P, 
is the log of the aggregate price level, Y, is the log of real GNP, and J, is a nominal 
interest rate, The parameters 6, and B, represent the effect of income and interest 
rates on desired cash holdings. Part of the adjustment in money balances to a 
change in income is thought to take place immediately, with further adjustments 
coming in subsequent periods. The parameter f, characterizes this partial adjust- 
ment, The disturbance v? represents factors other than income and interest rates 
that influence money demand. 

It was once common practice to estimate such a money demand equation 
with Cochrane-Orcutt adjustment for first-order serial correlation. The implicit 
assumption behind this procedure is that 


v2 = pv2, + u?, [11.6.2] 


where w? is white noise. Write equation [11.6.2] as (1 — pL)v? = u? and multiply 
both sides of [11.6.1] by (1 - pL): 


M, ~ P, = (1 — p)Bu + BY, — BieY-1 + Bol, — Brpl-r [11.6.3] 
+ (Bs + p)(M,-. — Pr.) — B3e(M,-2 — P,-2) + u?. 


324 Chapter 11 | Vector Autoregressions 


Equation [11.6.3] is a restricted version of 


M, — P, = a + @Y, + @:Y,_, + al, + ayl,_, 


[11.6.4] 
+ as(M,_, — P,_,) + a(M,_2 — P,-2) + uP, 


where the seven parameters (ay, a, ..., @) are restricted in [11.6.3] to be 
nonlinear functions of the underlying five parameters (p, By, B,, B2, B:). The 
assumption of [11.6.2] can thus be tested by comparing the fit of [11.6.3] with that 
from unconstrained estimation of [11.6.4]. 

By definition, v ? represents factors influencing money demand for which the 
researcher has no explicit theory. It therefore seems odd to place great confidence 
in a detailed specification of its dynamics such as [11.6.2] without testing this 
assumption against the data. For example, there do not seem to be clear theoretical 
grounds for ruling out a specification such as 


D— D D D 
Ve = Pi, + P2v;-2 + Uy, 


or, for that matter, a specification in which v? is correlated with lagged values of 
Y or 1. 

Equation [11.6.1] further assumes that the dynamic multiplier relating money 
demand to income is proportional to that relating money demand to the interest 
rate: 


0 M, a ae P, S. ey 
ar aaa = BBS 
a(Miss ne Pix.) a * 
ro ghee B2B3. 


Again, it seems a good idea to test this assumption before imposing it, by comparing 
the fit of [11.6.1] with that of a more general dynamic model. Finally, inflation 
may have effects on money demand that are not captured by nominal interest rates, 
The specification in [11.6.1] incorporates very strong assumptions about the way 
nominal money demand responds to the price level. 

To summarize, a specification such as [11.6.1] and [11.6.2] implicitly imposes 
many restrictions on dynamics for which there is little or no justification on the 
basis of economic theory. Before relying on the inferences of [11.6.1] and [11.6.2], 
it seems a good idea to test that model against a more general specification such 
as 


M, a ky + BYP, + BY, + BYL 

+ BWM, a BYP, + BiyY-1 + Biol) [11.6.5] 

+ BYDM,-2 + Bi2Pi-2 + BIVY,-2 + Bigl-at °° 

Ba BYM,_, + BYP P,-p + BY? Y,_, + BIOL» a2 up, 
Like equation [11.6.1], the specification in [11.6.5] is regarded as a structural money 
demand equation; 8‘) and 8“? are interpreted as the effects of current income 
and the interest rate on desired money holdings, and u? represents factors influ- 
encing money demand other than inflation, income, and interest rates. Compared 
with [11.6.1], the specification in [11.6.5] generalizes the dynamic behavior for the 
etror term v2, the partial adjustment process, and the influence of the price level 
on desired money holdings. 


11.6. Vector Autoregressions and Structural Econometric Models 325 


Although [11.6.5] relaxes many of the dubious restrictions on the dynamics 
implied by [11.6.1], it is still not possible to estimate [11.6.5] by OLS, because of 
simultaneous equations bias. OLS estimation of [11.6.5] will summarize the cor- 
relation between money, the price level, income, and the interest rate. The public’s 
money demand adjustments are one reason these variables will be correlated, but 
not the only one. For example, each period, the central bank may be adjusting 
the interest rate /, to a level consistent with its policy objectives, which may depend 
on current and lagged values of income, the interest rate, the price level, and the 
money supply: 


1, = ky + BYDM, + BYP, + BYY, 


+ BXYM,_, + BYP,-1 + BRY,-, + BYL-1 [11.6.6] 
+ BPM,-2. + BYP,-2 + BEY,-2 + BAL-2+ °° 
+ BYM,_» + BYP,» + BY Y-p + BYL-p» + uf. 


Here, for example, 8‘ captures the effect of the current price level on the interest 
rate that the central bank tries to achieve. The disturbance u© captures changes 
in policy that cannot be described as a deterministic function of current and lagged 
money, the price level, income, and the interest rate. If the money demand dis- 
turbance u? is unusually large, this will make M, unusually large. If 64) > 0, this 
would cause /, to be unusually large as well, in which case u? would be positively 
correlated with the explanatory variable J, in equation [11.6.5]. Thus, [11.6.5] 
cannot be estimated by OLS. 

Nor is central bank policy and endogeneity of /, the only reason to be con- 
cerned about simultaneous equations bias. Money demand disturbances and changes 
in central bank policy also have effects on aggregate output and the price level, so 
that Y, and P, in [11.6.5] are endogenous as well. An aggregate demand equation, 
for example, might be postulated that relates the level of output to the money 
supply, price level, and interest rate: 


Y, = ky + BSDYM, + BYP, + BYL, 
+ BWM, + BYP.-1 + BYY-1 + BYL-, [11.6.7] 
+ BS) M. 1-2 + BYP,_2 + BY Y,—-2 at BSD 1,-2 a ea 
+ BLOM, + BYP,» + BYY.-p + BY L-p + us, 


with u@ representing other factors influencing aggregate demand. Similarly, an 
aggregate supply curve might relate the aggregate price level to the other variables 
being studied. The logical conclusion of such reasoning is that all of the date ¢ 
explanatory variables in [11.6.5] should be treated as endogenous. 


Relation Between Dynamic Structural Models 
and Vector Autoregressions 


The system of equations [11.6.5] through [11.6.7] (along with an analogous 
aggregate supply equation describing P,) can be collected and written in vector 
form as 


By, = k + Byy,-, + Boy,-2 + +++ + B,Y,-p + u, [11.6.8] 


326 Chapter 11 | Vector Autoregressions 


where 


y, = (M,, P,, ¥,, 1) . 


u, = (u?, uF, us, uF)’ 
1 -si? -Biy -BYe 
ee Oe 
-BYY -BY 1 BM 
=Bu "Be Re 
k = (ki, kp, k3, 4)! 


and B, is a (4 x 4) matrix whose row i, column j element is given by BY fors = 
1,2,....,p. A large class of structural models for an (a x 1) vector y, can be 
written in the form of [11.6.8]. 

Generalizing the argument in [11.6.3], it is assumed that a sufficient number 
of lags of p are included and the matrices B, are defined so that u, is vector white 
noise. If instead, say, u, followed an rth-order VAR, with 


u, = Fyu,_; + Fou,_, + + *': + F,u,_, + e,, 


then we could premultiply [11.6.8] by (I, - F,L! — F,L? — --- — F.L’) to arrive 
at a system of the same basic form as [11.6.8] with p replaced by (p + r) and with 
u, replaced by the white noise disturbance e,. 

If each side of [11.6.8] is premultiplied by By ', the result is 


y,= c+ Dy,_, + By,. +--+ + B,y,_, + &, [11.6.9] 
where 
c = By'k {11.6.10] 
®,=B,'B, fors=1,2,..., p {11.6.11] 
e, = By 'u,. {11.6.12] 


Assuming that [11.6.8] is parameterized sufficiently richly that u, is vector white 
noise, then e, will also be vector white noise and [11.6.9] will be recognized as the 
vector autoregressive representation for the dynamic structural system [11.6.8]. 
Thus, a VAR can be viewed as the reduced form of a general dynamic structural 
model. 


Interpreting Impulse-Response Functions 
In Section 11.4 we calculated the impulse-response function 


IY; 45 
OE, 


{11.6.13] 


This magnitude describes the effect of an innovation in the jth variable on future 
values of each of the variables in the system. According to [11.6.12], the VAR 
innovation ¢,, is a linear combination of the structural disturbances u,. For example, 


11.6. Vector Autoregressions and Structural Econometric Models 327 


it might turn out that 
€, = 0.3u? ~ 0.6u5 + O.1uA — 0.5u°. 


In this case, if the cash held by the public is larger than would have been forecast 
using the VAR (e,, is positive), this might be because the public’s demand for cash 
is higher than is normally associated with the current level of income and interest 
rate (that is, u? is positive). Alternatively, ¢,, might be positive because the central 
bank has chosen to ease credit (uf is negative), or a variety of other factors. In 
general, ¢,, represents a combination of all the different influences that matter for 
any variables in the economy. Viewed this way, it is not clear why the magnitude 
[11.6.13] is of particular interest. 
By contrast, if we were able to calculate 


Ira 
a 1.6.14 
aus {11.6.14] 


this would be of considerable interest. Expression [11.6.14] identifies the dynamic 
consequences for the economy if the central bank were to tighten credit more than 
usual and is a key magnitude for describing the effects of monetary policy on the 
economy. 

Section 11.4 also discussed calculation of an orthogonalized impulse-response 
function. For & = E(e,e;), we found a lower triangular matrix A and a diagonal 
matrix D such that = ADA’. We then constructed the vector A~'e, and calcu- 
lated the consequences of changes in each element of this vector for future values 
of y. 

Recall from [11.6.12] that the structural disturbances u, are related to the 
VAR innovations e€, by 


u, = Boe, {11.6.15] 


Suppose that it happened to be the case that the matrix of structural parameters 
B, was exactly equal to the matrix A~'. Then the orthogonalized innovations would 
coincide with the true structural disturbances: 


u, = Be, = A7'e,. [11.6.16] 


In this case, the method described in Section 11.4 could be used to find the answers 
to important questions such as [11.6.14]. 

Is there any reason to hope that By and A~' would be the same matrix? Since 
Ais lower triangular, this clearly requires B, to be lower triangular. In the example 
[11.6.8], this would require that the current values of P, Y, and / do not influence 
money demand, that the current value of M but not that of Y or / enters into the 
aggregate supply curve, and so on. Such assumptions are rather unusual, though 
there may be another way to order the variables such that a recursive structure is 
more palatable. For example, a Keynesian might argue that prices respond to other 
economic variables only with a lag, so that the coefficients on current variables in 
the aggregate supply equation are all zero. Perhaps money and interest rates in- 
fluence aggregate demand only with a lag, so that their current values are excluded 
from the aggregate demand equation. One might try to argue further that the 
interest rate affects desired money holdings only with a lag as well. Because most 
central banks monitor current economic conditions quite carefully, perhaps all the 
current values should be included in the equation for /,. These assumptions suggest 
ordering the variables as y, = (P,, Y,, M,, 1,)’, for which the structural model would 


328 Chapter 11 | Vector Autoregressions 


be 


0 0 0.600 
P, ky iny P, 
Y,| _ | ke By =O 0 0 Yy, 
my lel pas Be 0 0) 1M 
‘ + gM Bh Q ' 
Bai a Pas 
Bi) Bi? BIY BMY p 
a) qa) a) a) tat 
21 22 3 Boy Y,-1 
Fl saith: atthe Stick tie te [11.6.17] 
Ba By 33 Baa I 
t-1 
BY? BY BY BYP 
BY ee Bw BY) e 
0 n ” twp t 
ae ee 
2 D 
By BY Be py} |Minp) | ue 


BY BY BY BY? 


Suppose there exists such an ordering of the variables for which B, is lower 
triangular. Write the dynamic structural model [11.6.8] as 


Boy, = —Fx, + u,, [11.6.18] 
where 
> er ose’ 
fur x (ep + tI Ik B, B, B,] 
1 
Yr 
4 = 
lp + x1] Yi-2 
Yi-p 


Suppose, furthermore, that the disturbances in the structural equations are serially 
uncorrelated and uncorrelated with each other: 


E(u,u,) = { 


D fort=Tr 


11.6.19 
0 otherwise, 


where D is a diagonal matrix. The VAR is the reduced form of the dynamic structural 
model [11.6.18] and can be written as 


y, = II’x, + €,, [11.6.20] 

where 
WW’ = -By;'r [11.6.21] 
€, = By 'u,. [11.6.22] 


Letting © denote the variance-covariance matrix of €,, [11.6.22] implies 


Q = Efez,e;) = By'E(u,u,)(Be')' = Be'D(Bz')’. [1.6.23] 


11.6. Vector Autoregressions and Structural Econometric Models 329 


Note that if the only restrictions on the dynamic structural model are that B, 
is lower triangular with unit coefficients along the principal diagonal and that D is 
diagonal, then the structural model is just identified. To see this, note that these 
restrictions imply that By ' must also be lower triangular with unit coefficients along 
the principal diagonal. Recall from Section 4.4 that given any positive definite 
symmetric matrix 2, there exist a unique lower triangular matrix A with 1s along 
the principal diagonal and a diagonal matrix D with positive entries along the 
principal diagonal such that 2 = ADA’. Thus, unique values By! and D of the 
required form can always be found that satisfy [11.6.23]. Moreover, any B, matrix 
of this form is nonsingular, so that F in [11.6.21] can be calculated uniquely from 
B, and II as F = —B,II’. Thus, given any allowable values for the reduced-form 
parameters (II and 22), there exist unique values for the structural parameters (B,,, 
F, and D) of the specified form, establishing that the structural model is just 
identified. 

Since the model is just identified, full-information maximum likelihood (FIML) 
estimates of (B,, F, and D) can be obtained by first maximizing the likelihood 
function with respect to the reduced-form parameters (II and (2) and then using 
the unique mapping from reduced-form parameters to find the structural param- 
eters. The maximum likelihood estimates of II are found from OLS regressions of 
the elements of y,on x,, and the MLE of 2 is obtained from the variance-covariance 
matrix of the residuals from these regressions. The estimates By! and D are then 
found from the triangular factorization of 2. This, however, is precisely the pro- 
cedure described in calculating the orthogonalized innovations in Section 11.4. The 
estimate A described there is thus the same as the FIML estimate of By'. The 
vector of orthogonalized residuals u, = A~'s, would correspond to the vector of 
structural disturbances, and the orthogonalized impulse-response coefficients would 
give the dynamic consequences of the structural events represented by u,, provided 
that the structural model is lower triangular as in {11.6.17]. 


Nonrecursive Structural VARs 


Even if the structural model cannot be written in lower triangular form, it 
may be possible to give a structural interpretation to a VAR using a similar idea 
to that in equation [11.6.23]. Specifically, a structural model specifies a set of 
restrictions on By and D, and we can try to find values satisfying these restrictions 
such that By 'D(B,')’ = Q. This point was developed by Bernanke (1986), Blan- 
chard and Watson (1986), and Sims (1986). 

For illustration, consider again the model of supply and demand discussed in 
equations [9.3.2] and [9.3.3]. In that specification, quantity (q,) and price (p,) were 
endogeneous variables and weather (w,) was exogenous, and it was assumed that 
both disturbances were i.i.d. The structural VAR approach to this model would 
allow quite general dynamics by adding p lags of all three variables to equations 
[9.3.2] and [9.3.3], as well as adding a third equation to describe the dynamic 
behavior of weather. Weather presumably docs not depend on the behavior of the 
market, so the third equation would for this example just be a univariate auto- 
regression. The model would then be 


q. = BP. + BY a-t + BYP Pia az BPW, 
+ BY q-2 + BR P-2 + BQm-2 t+: [11.6.24] 
+ BW Ging A BE Pcp + BE Wise: Fut 


330 Chapter 11 | Vector Autoregressions 


q = yp, + hw, + BS q-1 + BY p,-1 + BSD, — 


+ BY aq-2 + BS p21 + BYw 2 + - [1.6.25] 
gr BY dp of BY Di-p + BYOw,_» + uF 
w, = BYw,-, + Bw. +-°-> + BYw,_, tu [1.6.26] 


We could then take (u@, us, uj" )' to be a white noise vector with diagonal variance- 
covariance matrix given by D. This is an example of a structural model [11.6.18] 


in which 
1 -gp 0 
B,=|1 -y -A}. [1.6.27] 


There is no way to order the variables so as to make the matrix B, lower 
triangular. However, equation [11.6.22] indicates that the structural disturbances 
u, are related to the VAR residuals €, by «, = B,'u,. Thus, if By is estimated by 
maximum likelihood, then the impulse-response functions could be calculated as 
in Section 11.4 with A replaced by B,', and the results would give the effects of 
each of the structural disturbances on subsequent values of variables of the system. 
Specifically, 


dE, 
—=B,', 
du; ” 

so that the effect on e, of the jth structural disturbance u, is given by b/, the jth 
column of Bg '. Thus, we would calculate 


9Yr +s PY, 45 8, Webi 
i de; Ou,, . 


for W, the (n X n) matrix of coefficients for the sth lag of the MA(~) representation 
[11.4.1]. 


FIML Estimation of a Structural VAR 
with Unrestricted Dynamics 


FIML estimation is particularly simple if there are no restrictions on the 
coefficients F on lagged variables in [11.6.18]. For example, this would require 
including lagged values of p,_; and g,_; in the weather equation [11.6.26]. Using 
[11.6.23], the log likelihood function for the system [11.6.18] can be written as 


£(B,, D, 1) = —(Tn/2) log(2a) — (7/2) log|By'D(B5')'| 


— (1/2) > ly, — W’x]'{By'D(Bs!)'}-Iy, — H'x]). [11.6.28] 


If there are no restrictions on lagged dynamics, this is maximized with respect to 
II by OLS regression of y, on x,. Substituting this estimate into [11.6.28] as in 


11.6. Vector Autoregressions and Structural Econometric Models 331 


[11.1.25] produces 
£(By, D, 1) 


— (Tni2) log(2m) — (T?2) log|By 'D(By ')'| 


r [11.6.29] 
— (12) > &[By'DBs')']-'é.. 
But 
T T 
> &/[By 'D(By')']~'é, = >> trace{é/[By 'D(By ')’]-'é} 
= > trace{[B; 'D(B, ')' ]- 'E,€, 

[1.6.30] 
= trace{[B; 'D(By')']- IT. Q} : 
= T x trace{[B> 'D(B, ')']-'9} 
= T X trace{(B)D~'B,)Q}. 

Furthermore, 


log|By 'D(B;')’| = log{|By '|-|D|-|By']} = —log|By|? + log|D|. [11.6.31] 
Substituting [11.6.31] and [11.6.30] into [11.6.29], FJML estimates of the structural 
parameters are found by choosing By and D so as to maximize 

¥(By, D,W) = —(Tn/2) log(2m) + (7/2) log|Ba|? — (7/2) log|D| 
— (7/2) trace{(B\D~'B,)Q}. 

Using calculations similar to those used to analyze [11.1.25], one can show 

that if there exist unique matrices B, and D of the required form satisfying 


By 'D(By')’ = 2, then maximization of [11.6.32] will produce estimates B, and 
D satisfying 


[1.6.32] 


Bs 'D(Bs')' = 2. [11.6.33] 
This is a nonlinear system of equations, and numerical maximization of [11.6.32] 


offers a convenient general approach to finding a solution to this system of 
equations. 


Identification of Structural VARs 


The existence of a unique maximum of {11.6.32] requires both an order 
condition and a rank condition for identification. The order condition is that B, 
and D have no more unknown parameters than 2, Since 2 is symmetric, it can 
be summarized by n(n + 1)/2 distinct values. If D is diagonal, it requires n param- 
eters, meaning that B, can have no more than n(# — 1)/2 free parameters. For the 
supply-and-demand example of [11.6.24] through [11.6.26], 2 = 3, and the matrix 
B, in [11.6.27] has 3(3 — 1)/2 = 3 free parameters (f, y, and A). Thus, that example 
satisfies the order condition for identification. 

Even if the order condition is satisfied, the model may still not be identitied, 


For example, suppose that 
1 -g 0 
B=} 1 —-y O}. 
0 0 1 


332 Chapter 11 | Vector Autoregressions 


Even though this specification satisfies the order condition, it fails the rank con- 
dition, since the value of the likelihood function will be unchanged if 6 and y are 
switched along with a9 and o?. 

To characterize the rank condition, suppose that there are n, elements of B, 
that must be estimated, collect these in an (mg X 1) vector @,. The identifying 
assumptions can be represented as a known (n? X ng) matrix Sg and a known 
(n? x 1) vector s, for which 


vec(B,) = 8,0, + Sa. {11.6.34] 
For example, for the dynamic model of supply and demand represented by [11.6.27], 


-B B 
vec(Bo) a 05 Y 
h 


_ 


Sg 


wn 
a 
I 
Coo oo!looonw 
_ 


coool! aooce 

colonoaoooo 
i] 

Re oococ”corre 


Similarly, collect the unknown elements of D in an (mp X 1) vector @p, with 
vec(D) = Sp@p + Sp [11.6.35] 


for Sp an (n? X Hp) matrix and Sp an (n? X 1) vector. For the supply-and-demand 
example, 


vec(D) 


Ml 
ocoofXocoo 

CS 
S 
| 
q 
N 


eaoocorcoaocs 
rFooocncdcos 
~”A 
is} 
u 
eoeooocc0cd 


11.6. Vector Autoregressions and Structural Econometric Models 333 


Since [11.6,33] is an equation relating two symmetric matrices, there are 
n* = n(n + 1)/2 separate conditions, represented by 


vech(Q) = vech((8,(0,)1- "(DC o)KB(00)}"'). [11.6.36] 
Denote the right side of [11.6.36] by f(@,, @5), where f: (R” x R"”) > R™: 
vech(Q) = f(@z, Op). {11.6.37] 


Appendix 11.B shows that the [1* x (sg + #p)] matrix of derivatives of this 
function is given by 


{2 2 vech(Q) a ret 
30, 405 [11.6.38] 
= [t-20;10 @ Br')S,) — DF[(By') @ (Bi 180]. 


where D,* is the (n* x n?) matrix defined in [11.1.45]. 

Suppose that the columns of the matrix in [11.6.38] were linearly dependent; 
that is, suppose there exists a nonzero [(mg + Np) X 1] vector A such that JA = 0. 
This would mean that if a small multiple of A were added to (@;, 05)’, the model 
would imply the same probability distribution for the data. We would have no basis 
for distinguishing between these alternative values for (@;, 85), meaning that the 
model would be unidentified. 

Thus, the rank condition for identification of a structural VAR requires that 
the (2, + mp) columns of the matrix J in [11.6.38] be linearly independent.'? The 
order condition is that the number of rows of J (n* = n(n + 1)/2) be at least as 
great as the number of columns. 

To check this condition in practice, the simplest approach is usually to make 
a guess as to the values of the structural parameters and check J numerically. 
Giannini (1992) derived an alternative expression for the rank condition and pro- 
vided computer software for checking it numerically, 


Structural VAR with Restrictions on I 


The supply-and-demand example of [11.6.24] to [11.6.26] did not satisfy the 
assumptions behind the derivation of [11.6.32], because [11.6.26] imposed the 
restriction that lagged values of p and q did not belong in the weather equation, 
Where such restrictions are imposed, it is no longer that case that the FIML 
estimates of II are obtained by OLS, and system parameters would have to be 
estimated as described in Section 11.3. As an alternative, OLS estimation of [11.6.24] 
through [11.6.26] would still give consistent estimates of II, and the variance- 
covariance matrix of the residuals from these regressions would provide a consistent 
estimate ©. One could still use this estimate in [11.6.32], and the resulting max- 
imization problem would give reasonable estimates of B, and D. 


Structural VARs and Forward-Looking Behavior 


The supply-and-demand example assumed that lagged values of price and 
quantity did not appear in the equation for weather. The spirit of VARs is that 


"This condition characlerizes focal identification; it may be that even if a model satisfies both the 


rank and the order condition, there are two noncontiguous values of (8;,, 8;,) for which the likelihood 
has the sare value for all realizations of the data, See Rothenberg (1971, Theorem 6, p. 585), 


334 Chapter 11 | Vector Autoregressions 


such assumptions ought to be tested before being imposed. What should we con- 
clude if, contrary to our prior expectations, the price of oranges turned out to 
Granger-cause the weather in Florida? It certainly cannot be that the price is a 
cause of the weather. Instead, such a finding would suggest forward-looking be- 
havior on the part of buyers or sellers of oranges; for example, it may be that if 
buyers anticipate bad weather in the future, they bid up the price of oranges today. 
If this should prove to be the case, the identifying assumption in [11.6.24] that 
demand depends on the weather only through its effect on the current price needs 
to be reexamined. Proper modeling of forward-looking behavior can provide an 
alternative way to identify VARs, as explored by Flavin (1981), Hansen and Sargent 
(1981), and Keating (1990), among others. 


Other Approaches to Identifying Structural VARs 


Identification was discussed in previous subsections primarily in terms of 
exclusion restrictions on the matrix of structural coefficients B,. Blanchard and 
Diamond (1989, 1990) used a priori assumptions about the signs of structural 
parameters to identify a range of values of B, consistent with the data. Shapiro 
and Watson (1988) and Blanchard and Quah (1989) used assumptions about long- 
run multipliers to achieve identification. 


A Critique of Structural VARs 


Structural VARs have appeal for two different kinds of inquiry. The first 
potential user is someone who is primarily interested in estimating a structural 
equation such as the money demand function in [11.6.1]. If a model imposes 
restrictions on the dynamics of the relationship, it seems good practice to test these 
restrictions against a more general specification such as [11.6.5] before relying on 
the restricted model for inference. Furthermore, in order to estimate the dynamic 
consequences of, say, income on money demand, we have to take into account the 
fact that, historically, when income goes up, this has typically been associated with 
future changes in income and interest rates. What time path for these explanatory 
variables should be assumed in order to assess the consequences for money demand 
at time ¢ + s of a change in income at time t? A VAR offers a framework for 
posing this question — we use the time path that would historically be predicted 
for those variables following an unanticipated change in income. 

A second potential user is someone who is interested in summarizing the 
dynamics of a vector y, while imposing as few restrictions as possible. Insofar as 
this summary includes calculation of impulse-response functions, we need some 
motivation for what the statictics mean. Suppose we find that there is a temporary 
tise in income following an innovation in money. One is tempted to interpret this 
finding as suggesting that expansionary monetary policy has a positive but tem- 
porary effect on output. However, such an interpretation implicitly assumes that 
the orthogonalized ‘‘money innovation” is the same as the disturbance term in a 
description of central bank policy. Insofar as impulse-response functions are used 
to make statements that are structural in nature, it seems reasonable to try to use 
an orthogonalization that represents our understanding of these relationships as 
well as possible. This point has been forcefully argued by Cooley and LeRoy (1985), 
Leamer (1985), Bernanke (1986), and Blanchard (1989), among others. 

Even so, it must be recognized that convincing identifying assumptions are 
hard to come by. For example, the ordering in [11.6.17] is clearly somewhat ar- 
bitrary, and the exclusion restrictions are difficult to defend. Indeed, if there were 
compelling identifying assumptions for such a system, the fierce debates among 


11.6. Vector Autoregressions and Structural Econometric Models 335 


macroeconomists would have been settled long ago! Simultaneous equations bias 
is very pervasive in the social sciences, and drawing structural inferences from 
observed correlations must always proceed with great care. We surely cannot always 
expect to find credible identifying assumptions to enable us to identify the causal 
relations among any arbitrary set of n variables on which we have data. 


11.7. Standard Errors for Impulse-Response Functions 


Standard Errors for Nonorthogonalized Impulse-Response 
Function Based on Analytical Derivatives 


Section 11.4 discussed how W,, the matrix of impulse-response coefficients 
at lag s, would be constructed from knowledge of the autoregressive coefficients. 
In practice, the autoregressive coefficients are not known with certainty but must 
be estimated by OLS regressions. When the estimated values of the autoregressive 
coefficients are used to calculate W,, it is useful to report the implied standard 
errors for the estimates W,.'° 

Adopting the notation from Proposition 11.1, let k = np + 1 denote the 
number of coefficients in each equation of the VAR and let m = vec(II) denote 
the (nk x 1) vector of parameters for all the equations; the first k elements of w 
give the constant term and autoregressive coefficients for the first equation, the 
next & elements of a give the parameters for the second equation, and so on. Let 
ws, = vec(W!).denote the (n? x 1) vector of moving average coefficients associated 
with lag s. The first n elements of , are given by the first row of W, and identify 
the response of y,,,,, to €,. The next ” elements of , are given by the second row 
of W, and identify the response of y,,,,, to €,, and so on. Given the values of the 
autoregressive coefficients in a, the VAR can be simulated to calculate ,. Thus, 
ws, could be regarded as a nonlinear function of , represented by the function 
U,(77), be RY > RM. 

The impulse-response coefficients are estimated by replacing m with the OLS 
estimates #,, generating the estimate h,7 = ,(#7). Recall that under the con- 
ditions of Proposition 11.1, \/T(#; — m) — X, where 


X ~ n(0. (960°). [11.7.1] 
Standard errors for §, can then be calculated by applying Proposition 7.4: 


VT(h..r — U,) > GX, 


where 
_ of, (a) 
Genk) OT [11.7.2] 
That is, 
Vibr ~ 4) 4 0(0.6,(0 @ 0-961). [11.7.3] 


Standard errors for an estimated impulse-response coefficient are given by the 


Calculations related to those developed in this section appeared in Baillie (1987), Litkepohl (1989, 
1990), and Giannini (1992), Giannini provided computer software for calculating some of these mag- 
nitudes. 


336 Chapter 11 | Vector Autoregressions 


square root of the associated diagonal element of (1/T)G, (2,-@ O07 NG: where 


G = Ga) 


On 


mwetty 


Z 
(VT) > x,x/, 
t= 


Q, 
with x, and Q,,.as defined in Proposition 11.1. 
To apply this result, we need an expression for the matrix G, in [11.7.2]. 


Appendix 11.B to this chapter establishes that the sequence {G,}_, can be cal- 
culated by iterating on 


G, a {L, re) (9, Wee Wi ios W._»)] + (®, 2) 1,)G,_, 
# (®, © 1,)G,_2 sree ete (®, ® ,)G, _,. 


Here 0,,, denotes an (m x 1) vector of zeros. The iteration is initialized by setting 


[11.7.4] 


Gy = G_, = + = G_y41 = Ona,4- It is also understood that W, = I, and 
Ww, = 0,,, for s < 0. Thus, for example, 

G, = {L, © (0, I, 0, pete 0,,.)] 

G, = IL, ® (0, Vv I, aes 0,,)] + (®, ) 1,)G,. 


A closed-form solution for [11.7.4] is given by 
G, = SS [(¥;-,@ 0, Ve; a oe Weipa): [11.7.5] 


i=l 


Alternative Approaches to Calculating Standard Errors 
for Nonorthogonalized Impulse-Response Function 


The matrix of derivatives G, can alternatively be calculated numerically as 
follows. First we use the OLS estimates # to calculate &,(#) fors = 1,2,..., 
m, We then increase the value of the ith element of a by some small amount A, 
holding all other elements constant, and evaluate ,(# + e,A)fors = 1,2,..., 
m, where e, denotes the ith column of I,,. Then the (n? x 1) vector 


(me + e@A) — (4%) 
A . 
gives an estimate of the ith column of G,. By conducting separate evaluations of 
the sequence ,(% + e;A) foreach i = 1,2,..., nk, all of the columns of G, 
can be filled in. 

Monte Carlo methods can also be used to infer the distribution of ws,(#r). 
Here we would randomly generate an (nk x 1) vector drawn from a Naz, 
(VTA @ Q7!)) distribution. Denote this vector by a", and calculate w,(ar!), 
Draw asecond vector a‘) from the same distribution and calculate ys, (a). Repeat 
this for, say, 10,000 separate simulations. If 9500 of these simulations result in a 
value of the first element of wy, that is between #,, and bars then (#1, #1) can be 
used as a 95% confidence interval for the first element of &,. 

Runkle (1987) employed a related approach based on bootstrapping. The idea 
behind bootstrapping is to obtain an estimate of the small-sample distribution of 
@ without assuming that the innovations ¢, are Gaussian. To implement this pro- 
cedure, first estimate the VAR and save the coefficient estimates # and the fitted 
residuals {@,, .,.. . , €;). Then consider an artificial random variable u that has 
probability (1/T) of taking on each of the particular values {€,, €,,. . . , é;}. The 


11.7. Standard Errors for Impulse-Response Functions 337 


hope is that the distribution of wis similar to the distribution of the true population 
e’s, Then take a random draw from this distribution (denoted u‘'’), and use this 
to construct the first innovation in an artificial sample; that is, set 


yiv=e+ %,y, at 6,y_, eae ®,Y ps1 + ul, 


where yy, y_,,...,andy_,,, denote the presample values of y that were actually 
observed in the historical data. Taking a second draw u$"), generate 


yyy=e+ Biy§) + Dyyy ts + ®,Y—p42 + uf). 


Note that this second draw is with replacement; that is, there is a (1/T) chance 
that u‘" is exactly the same as u$!). Proceeding in this fashion, a full sample 
{y{), yS?, ..., yS)} can be generated. A VAR can be fitted by OLS to these 
simulated data (again taking presample values of y as their historical values),. 
producing an estimate #‘"), From this estimate, the magnitude w,(#") can be 
calculated. Next, generate a second set of T draws from the distribution of u, 
denoted {u?’, ul, ..., u2}, fit a to these data by OLS, and calculate 
us, (4), A series of 10,000 such simulations could be undertaken, and a 95% 
confidence interval for w,,(#%) is then inferred from the range that includes 95% 
of the values for w,, (a). 


Standard Errors for Parameters of a Structural VAR 


Recall from Proposition 11.2 and equation [11.1.48] that if the innovations 
are Gaussian, 


VT [vech(@,) — vech(Q)] > v(0, 2D, (2 @ 2D; i) 
The estimates of the parameters of a structural VAR (By and D) are determined 
as implicit functions of 2 from 
Q = By DB)’. [11.7.6] 
As in equation [11.6.34], the unknown elements of B, are summarized by an 
(ng X 1) vector @, with vec(B,) = S,@, + S,. Similarly, as in [11.6.35], it is 


assumed that vec(D) = Sp@p + Sp for @p an (np X 1) vector. It then follows from 
Proposition 7:4 that 


VT(6a.7 - 95) SN (0 26,D;7 (2 @ 2)\(D; Gs) [11.7.7] 


VT (8o.7 - &) > v(o. 2G_D; (2 @ 2)(D+ Gb). [11.7.8] 


where 
00, 
G = = 11.7.9 
(gx) O[vech(Q)]' [ ] 
005 


Gop == 11.7,10 
(1p a) a{vech(Q)]' [ ] 
and n* = n(n + 1)/2. 
Equation [11.6.38] gave an expression for the [n* X (ng + 1p)] matrix: 
j= 3 vech(Q) d vech(Q) 
30; 305 . 
We noted there that if the model is to be identified, the columns of this matrix 
must be linearly independent. In the just-identified case, n* = (ng + np) and J7! 


338 Chapter 11 | Vector Autoregressions 


exists, from which 


i ay (11.7.11] 


Standard Errors for Orthogonalized 
Impulse-Response Functions 


Section 11.6 described calculation of the following (# x #) matrix: 

H, = W,B;'. [1.7.12] 
The row /, column j element of this matrix measures the effect of the jth structural 
disturbance (u;,) on the ith variable in the system (y,,,,) after a lag of s periods. 
Collect these magnitudes in an (n? x 1) vector h, = vec(H'‘). Thus, the first 
elements of h, give the effect of u, on y;,,,, the next n elements give the effect of 
U,ON 2,45, 4 and so on. 

Since W, is a function of # and since B, is a function of vech(Q), the distri- 
butions of both the autoregressive coefficients and the variances affect the asymp- 
totic distribution of h,. It follows from Proposition 11.2 that with Gaussian in- 
novations, 


VT (hy.7 a h,) 


elas. a hieun 0 " 


: m(o, E.(0 @ O-)E, + 22,D/(9 @ aD; YE: a 
where Appendix 11.B demonstrates that 


E,, = ah,/an’ = [L, @ (By) ~']G, (11.7. 14] 
=i dh, 1\— 
= ; —[H, © (Bh) ~']S Ga. [1.7.15] 


Here G, is the matrix given in [11.7.5], Gg is the matrix given in [11.7.11], and S,. 
is an (n? X ng) matrix that takes the elements of @, and puts them in the corre- 
sponding position to construct vec(B,): 


vec(B,) = 8,0, + Sq. 
For the supply-and-demand examples of [11.6.24] to [11.6.26], 


0 0 0 
-1 


—_ 


0 
0 
0 
1 0 
0 
0 
0 


Practical Experience with Standard Errors 


In practice, the standard errors for dynamic inferences based on VARs often 
turn out to be disappointingly large (see Runkle, 1987, and Liitkepohl, 1990). 


11.7. Standard Errors for Impulse-Response Functions 339 


Although a VAR imposes few restrictions on the dynamics, the cost of this generality 
is that the inferences drawn are not too precise. To gain more precision, it is 
necessary to impose further restrictions. One approach is to fit the multivariate 
dynamics using a restricted model with far fewer parameters, provided that the 
data allow us to accept the restrictions. A second approach is to place greater 
reliance on prior expectations about the system dynamics. This second approach 
is explored in the next chapter. 


APPENDIX 11.A. Proofs of Chapter 11 Propositions 


® Proof of Proposition 11.1. The condition on the roots of [11.1.35] ensures that the MA(2) 
representation is absolutely summable. Thus y, is ergodic for first moments, from Propo- 
sitions 10,2(b) and 10,5(a), and is also ergodic for second moments. from Proposition 10.2(d). 
This establishes result 11.1(a). 

The proofs of results (b) and (c) are virtually identical to those for a single OLS 
regression with stochastic regressors (results [8.2.5] and [8,2,12]). 

To verify result (d), notice that 


VI (fir = t)) = [wn 3 2x] [\ VT) > xe,| 


and so 
Qz! (IVT) > XE, 
Vii, — 7) = on me 2M : [11.A.1] 
Q;! (I/VT) S XE yy 
where 


* 
ar= | arr) 3 xxi] 
toa | 
Define &, to be'the following (nk x 1) vector: 
XE, 
g = |*2 
Xn 


Notice that & is a martingale difference sequence with finite fourth moments and variance 


E(xx) E(ei) Ex) E(evex) 1+ E (4%) E (ene) 
E(&&!) = E(xx): E(ex€u) B(ae) E082) ui E(x.) E (Exen) 
Eq) E(Eney) E(Qx) E(Entx) 16+ E(%x,) ECE) 
E(ei) Eleuéa) +° Eleven) 
E (ex) E(e3) : E (x8) @ E(x,x') 
E(ene) E(éuts) ««*  E(e3) 
= 2@Q 


It can further be shown that 
s 
(VT) > £8; >2@Q [11.4.2] 
tm] 


340 Chapter 11 | Vector Autoregressions 


(see Exercise 11.1). It follows from Proposition 7.9 that 
a 
WvT) 3 &> (0, (Q@ Q)). [11.4.3] 
tat 
Now, expression [{11.A.1] can be written 


(IVT) Dy XE, 


Qe Oo 3 
0 Q;) 0 YL UMVT) D xen 
VT (it, -— m) = ae 
0 O° cet r! - 
- (UVT) & Xen 
= (L, ® Q;')U/VT) > é,. 
But result (a) implies that Q;' 4 Q°'. Thus, 
VU (ir ~ 7) 4 (1, @Q“YUVT) D &. [11.4.4] 


But from [11.A.3], this has a distribution that is Gaussian with mean 0 and variance 
(1, @ Q'Y(2 @ QL, @ Q-') = 1, QL) @ (Q°'QQ™') = 2OQ", 
as claimed. 
® Proof of Proposition 11.2. Define a = (1/T) 27.1 €,€; to be the estimate of N based 


on the true residuals. We first note that 2, has the same asymptotic distribution as 27. 
To see this, observe that 


= (UT) > (y, ~ Waply, ~ Hey 
= (UT) 3 ty, - Mi x, + (Ml, - M)’x)ly, - Wyx, + (I, - 0)’x) 
= WT) (y= Hexyty, — Hy [11.A.5] 
+ (lly — 1)'(1/T) > xx/(II, — 1) 
=O, + (i, - MT) > xxl, - M, 
where cross-product terms were dropped in the third equality on the right in the light of 


the OLS orthogonality condition (1/7)27,(y, — T;x,)x; = 0. Equation [11.A.5] implies 
that 


VT(Qs - G,) = (HI, - W)'(/7) 3 x.x[VT(H, - 1))- 


But Proposition 11.1 established that (I, - Wy’ 40, (U/T)ZZ, > Q, an 
VT, — TI) converges in distribution. Thus, from Proposition 7.3, aes 4,) a 
meaning that /7T(Q% - Q)4 4 VTA, - Q). 

Recalling [11.A.4], 


Vila, — 7] IE EL, @ MNMUVT) BE, 


VT[vech(Q) — vech(2)} av) Sa, , [11.4.6] 


Appendix 11.A. Proofs of Chapter 11 Propositions 341 


where & = e€, @ x, and 


> 
ey — Fu EyEx — Fig 98" Ew ~ Tin 
22 oe es 
d, = vech EE 1, = ox £3, Z ox Fubar | Fr, ‘ 
EmEy — Tyr EnEy — Gyr ‘'* Eu 7 Onn 


It is straightforward to show that (&/, A;)’ is a martingale difference sequence that satisfies 
the conditions of Proposition 7.9, from which 


(VT) & & Lay 0 Xu Zia 11.A.7 
WV) Sa, OF Es Fel = 


rat 


where 


EB =i] eae | 
Ea En} ~ LEQE) EQA)" 


Recall from the proof of Proposition 11.1 that 
Xn = EE) = 2@Q. 
A typical element of &,. is of the form 
E(x,€;,)(€&) = Gi) = E(x,)* E(e,€,&) = ay," E(x,)- E(é,), 
which equals zero for all i, j, and /. Hence, [11.A.7] becomes 


oP BH ([e}:[*22 2)) 
ic 4N . ; 
WT) RS ae 


and so, from [11.A.6}, 


VI[a, - 7] +n [o] [2#eeo' © 
VT[vech(Q,) — vech(Q)] 0| 0 S|] 


Hence, Proposition 11.2 will be established if we can show that E(A,A;) is given by the 
matrix &., described in the proposition; that is, we must show that 


E(éei, — Fi )(EvEnn — yy) = Fin + Fv [11.A.8] 


for all i, j, 4, and m. 
To derive [11.A.8], let 2 = PP’ denote the Cholesky decomposition of 9, and define 


v, = P-'g,. [11.4.9] 


Then E(v,v,;) = P-'Q(P-')’ = I, Thus, v,, is Gaussian with zero mean, unit variance, and 
fourth moment given by E(vi) = 3. Moreover, v, is independent of v, for i # j. 
Equation [11.A.9} implies 


e, = Py,. [11.A.10] 
Let p,, denote the row i, column j element of P. Then the ith row of [11.A.10] states that 
& = PaY + Piva + 0° + Pin (11.A.11] 

and 
G,Ee = (PAV + Pr¥y + °° + PiMn) X (paVu + Pay +66 + PV)  (LL.A12] 


Second moments of €, can be found by taking expectations of [11.A.12], recalling that 
E(v,,v,) = 1if i = j and is zero otherwise: 


E(E,€,) = PrP + PaPp t+) + PuPn [11.A.13] 


342 Chapter 11 | Vector Autoregressions 


Similarly, fourth moments can be found from 


E(,8,8180) = El(pavy + Pavu + 0* + Pia) Pau + Pea tot + Ply) 
X (pavy + Pavan t 00 + PuYn(PinYn + Pura $6 * + Pann) 
[3(PaPaPnPon + PePpPePaa + °°* + PinPinPuPon)l 
+ [Purp PePar + Pros + 1** + PirPun) 
+ (P2PpMPrPon + PaPux t °° + PuPum) + 0° 
+ (PinPjn)(PrPan + PePur t °° + Pin-1Pmw I 
+ [(PaPn)(PaPus + PjsPm3 tort + PP) 
+ (PaPe( PrP + PisPox t °° + PrPan) t00° 
+ (PmPuMPaPon + Pp2Pin2 Hees oe Pra Pun] 
+ [(PaPm)(PaPe + PPro t ++ + Pj Pud 
+ (PaPu2)(PpPn + PjsPrn t6* + Pi Pm) + 8* 
+ (PrP PaPn + PePe + °° + Pin-iPin-1)] 
= [Para + PoP + 0+ + PnP PnPon + PePmr + 70+ + PuPwd| 
+ [(PaPn + Pape ttt + PiyPi (PrP + PaPur ttt + PipPand) 
+ [(PaPut + PiaPor +6 °° + PiwPam(PaPn + PePe t11t + Pu Pu)] 
= Gi Fn + FT, + Tian Fp, 


[UL AL4] 
where the last line follows from (11.A.13]. Then 
El (Eee ~ Fy)EnEm ~ Tradl = E (Ene EE) — Fie Fiy = Tyan + Fy Ty. 
as claimed in [11.A.8]. @ 


= Proof of Proposition 11.3. First suppose that y fails to Granger-cause x, so that the 
Process can be written as in [11.2.4]. Define v., to be the residual from a projection of F., 
on &,,, with b, defined to be the projection coefficient: 


V2, © €3, — dues. 


Thus, v2, and ¢,, are uncorrelated, and, recalling that e, is white noise, v., must be uncorrelated 
with ¢,, for all ¢ # 7 as well. From the first row of [11.2.4], this means that v., and x, are 
uncorrelated for all ¢ and r. With this definition of v,,, the second row of [11.2.4] can be 
written as 


Y= Ba + Pa(Ljey, + ¥2(L)[V2, + b0e1]- [Vl.A.15] 


Furthermore, from the first row of [11.2.4], 


ey = (h(L)]-'@, — p,). [11.A. 16] 
Substituting [11.A.16] into [11.A.15] gives 
y, =e + O(L)x, + n,, {lb.A.17] 


where we have defined b(L) = {[¥2,(L) + bubo(L)[bi(L))- |}, ¢ = mw, — bC) my, and 
1, = Ya(L)v2,. But »,, being constructed from v;,, is uncorrelated with x, for ull r. Fur- 
thermore, only current and lagged values of x, as summarized by the operator b(L), appear 
in equation [11.A.17]. We have thus shown that if [11.2.4] holds, then d, = 0 for all j in 
[11.2.5]. 

To prove the converse, suppose that d, = 0 for all j in [11.2.5]. Let 


X= a, + (Le, [L1.A.18] 
denote the univariate Wold representation for x,; thus, ¢{? = 1. We will be using notation 


consistent with the form of [11.2.4] in anticipation of the final answer that will be derived: 
for now, the reader should view [11.A.18] as a new definition of ¥,,(L) in terms of the 


Appendix 11.A. Proofs of Chapter 11 Propositions 343 


univariate Wold representation for x. There also exists a univariate Wold representation 
for the error term in [11.2.5], denoted 


m = b2{L)v2,, [11.A.19} 
with yf? = 1. Notice that », as defined in [11.2.5] is uncorrected with x, for all ¢ and s. It 


follows that v,, is uncorrelated with x, or ¢,, for all rand 7. 
Substituting [11.A.18] and [11.A.19] into [11.2.5], 


y =e + D(L)w, + O(L)bi(Len + ¢n(L)va- [11.A.20] 
Define 
£2, = V2, + dues, (11.A.21] 
for b, the coefficient on L" of b(L) and 
Bb, =c + D(l)m,. [11.A.22} 


Observe that (¢,,, €2,)' is vector white noise. Substituting [11.A.21] and [11.A.22] into 
[11.A.20] produces 


Y= Wy + [O(L) by (L) - bobe(Leu + ¢22(L)e2, {11.A.23] 
Finally, define 
Ya(L) = [b(L)di(L) — bad2(L)}, 
noting that y) = 0. Then, substituting this into [11.A.23] produces 
Ye = Ba + Hy (Len + nl L)er,. 
This combined with [11.A.18] completes the demonstration that [11.2.5] implies [11.2.4]. ™ 


APPENDIX 11.B. Calculation of Analytic Derivatives 


This appendix calculates the derivatives reported in Sections 11.6 and 11.7. 


@ Derivation of [11.6.38]. Let the scalar represent some particular element of 8, or 9), 
and let aN/aé denote the (n? x 7) matrix that results when each element of 2 is differ- 
entiated with respect to é. Thus, differentiating [11.6.33] with respect to ¢ results in 


aQlag = (aBi'/3€)D(By')' + By (aD/aé)(Bz')' + (By')D[a(By')’g]. [11.8.1] 
Define 
X = (By /36)D(By')' [11.B.2} 
and notice that 
x’ = (By ')D[a(By ')‘/ag}, 


Since D is a variance-covariance matrix and must therefore be symmetric. Thus, [11.B.1} 
can be written 


aQlaé = x + Bz (aD/aé\(By')’ + x’. [11.B.3] 


Recall from Proposition 10.4 that 


vec(ABC) = (C’ @ A) ‘vec(B). [11.B.4] 
Thus, if the vec operator is applied to [11.B.3], the result is 
a vec(Q 
AME = veo(x + x’) + [(Bi") @ (Br) veo(aDIg). [LL.B 


344 Chapter 11 | Vector Autoregressions 


Let D, denote the (n? X *) duplication matrix introduced in [11.1.43]. Notice that 
for any (n X #) matrix x, the elements of D/, vec(x) are of the form og for diagonal elements 
of x and of the form (xi, + X,:) for off-diagonal elements. Hence, Dj, vec(x) = Dj, vec(x’). 
If [11.B.5] is premultiplied by D* = (D/D,) 'Dj,, the result is thus 


ne 


aa = 2s vec(x) + Dz[(Br") @ (Br')} vec(aDis2), [11.8.6] 


since from [11.1.46] D7 vec(Q) = vech(Q). 
Differentiating the identity B;'B, = 1, with respect to é produces 
(@B; '/aé)B, + By '(aB,/aé) = 0,,, 
or 
aB; Yaé = —By (0B, /ag)By'. (11.B.7] 


Thus, [11.B.2] can be written 
xX = —By '(6B,/4€)B, 'D(B, ')’= —B, '(@B,/ag)Q. 
Applying the vec operator as in [11.B.4] results in 


@ vec(B,,) 


vec(x) = -(0.@ Bs!) 


Substituting this expression into [11.B.6] gives 


ore = -2D/(0 @ By!) Bw a oe + D[(Bs) @ (BF) LD r) tet 


My [11.B.8] 


= -2D}(2 @ Be )Sp “, +-D; [(Bi') @ (Br )1S_ SE 


Expression [11.B.8] is an (n* x a vector that gives the effect of a change in some 
element of 8, or 8, On each of the n* elements of vech(Q). If € corresponds to the first 
element of @,, then 00,/ = e,, the first column of the (n, X m,) identity matrix, and 
40,/a€ = 0. If € corresponds to the second element of 0,, then 00,/d = e,. If we stack 


the vectors in [11.B.8] associated with € = @,.,,€ = O@y2,+.65€ = Onuy side by side, the 
result is 
: vech(Q) avech(Q) | a sec] 
805, 30n2 9514 [11.B.9] 
= [-2D7(Q @ Bye )Sp]le, e: +++ engl: 
That is, 
@ vech(Q 
ae T Ee = [= 2D;(0 @ By Su} [11.B.10} 


Similarly, letting the scalar in [11.B.8] correspond to each of the elements of 6, in succession 
and stacking the resulting columns horizontally results in 


@ vech(Q 
ore) — D185) @ (BFS. [1.8.11] 
D 
Equation [11.6.38] then follows immediately from [11.B.10] and [11.B.11]. @ 


® Derivation of [11.7.4]. Recall from equation [10.1.19] that 

W,= OY, + OW, 4 + OW, + [11.B.12} 
Taking transposes, 

We = Wy) + WL, tt WL, [11.B.13] 


Appendix 11.B. Calculation of Analytic Derivatives 345 


Let the scalar € denote some particular element of 7, and differentiate [11.B.13] with 
respect to & 


aw; a®) a@) a®,, 
=o t an 2 a a 
vag We ag * Wea ag 79g 
ayy, avi. avi, 
“1! 4 — = H, ttt bh @, 
a ag ag 
ac'/aé 
a@i/aé 
= (0, Wl, Wi, css Wi] | a@yaee 
: [11.B.14] 
a@' fae 
AW | aw, Ch 
+ eaES D+ ae @, + + aE ®, 
: , all 
bs (0,, Wee Vv, 2° ere Weal 9E 
awe, aw'_, avy, 
sol 4 2 pi + +—~o', 
tae ag aé ae 


Recall result [1!.B.4], and note the special case when A is the (12 X 12) identity matrix, 
B is an (n x r) matrix, and C is an (r x q) matrix: 


vec(BC) = (C’ @ I,) vec(B). [11.B.15} 


For exainplc, 


Another implication of [11.B.4] can be obtained by letting A be an (m x q) matrix, 
Ba (q x 1) matrix, and C the (a X 7) identity matrix: 


vec(AB) = (I, @ A) vec(B). {11.B.17] 


For example, 
vee, Wee, Wee Wl a) 
= (1,80, Wo, Wo. Wh (2m ay [11.B.18] 
=-(L@O, Wor Wea Ww (= =). 
Applying the vec operator to [11.B.14] and using [11.B.18] and [11.B.16] gives 
= Uh @ On Wier Wie oe Wiel (2) 
+ (®, @1,) (* me =.) + (®, @1,) (=) [11.B.19] 
+75 + (®, @1,) (=). 
Letting successively represent each of the elements of w and stacking the resulting 


346 Chapter 11 | Vector Autoregressions 


equations horizontally as in [11.B.9] results in 


Me 1,@ On. Wey Wa WD 
+ (©, @1,) [ae | ees + (0, @1,) |e | 


as claimed in [11.7.4]. @ 


@ Derivation of [11.7.5]. Here the task is to verify that if G, is given by [11.7.5], then 
[11.7.4] holds: 


G.= 1,00, Wi, Wis Wi + 3 (®@1)G-,. [118.20] 
kul 
Notice that for G, given by [11.7.5], 


> (©, @ 1,)G,-. 


, srk 

= > (®, @ I.) 2 [W-1@ (0. Wan, Waser oc Week -i-e d] 
wrk 

= > p> [®.Y-,@ (0, Weeges Wreceins ot Weegeieps dh 


For any given value for k and i, define vy = k + 7. Wheni = 1, then v = & + 1: when 
=2.thenv =k + 2; 80099, On; 


SOG = ST (MOO We Ber oo Wheel 
Keine 
Recalling further that W._,_, = é forv = 2,3,...,, we could equally well write 


2816, 
> > [ov W. wke 1 @ (0,, Wie Woes iat We ese) 


Awl ys2 
-> Sew v-k-1 @ On Wie Ween ee Wrens] [11.B.21] 


= > Ie a vek= 1) @ (a Wi, Wes bad $5450] 


vad 


x 


= 2 [%-1B On Pie Wer ot Week 


by virtue of [11.B.12]. If the first term on the right side of [11.B.20} is added to [11.B.21], 
the result is 


1B On Wig War WM + D @OLG.4 
= [L, @ (0, We, Wer OW ») 


+ 2D ([¥.-1@ On Wie Wiser Wool 


= ZUM @ On Wie Wave ot Vlecepedh 
which is indeed the expression for G, given in [11.7.5]. 


@ Derivation of [11.7.14] and [11.7.15]. Postmultiplying [11.7.12] by B, and transposing 
results in 


BiH, = Wi. _ [11.B.22] 


Let the scalar denote some element of m or 2, and differentiate [11.B.22] with respect 
to é: 
(aBi/agyH; + Bi(@H{/a) = aw soe. [11.B.23} 


Appendix 11.B. Calculation of Analytic Derivatives 347 


Applying the vec operator to [11.B,23] and using [11.B.15] and [11.B.17], 
(H, @ 1,)(@ vec(Bi)/ag) + (1, © Bi)( vec(H/ ae) = a vec(Wi)/ae, 
implying that 
ah,/aé = — (1, @ Bi)” '(H, @ 1,)(8 vec(Bi/ag) + (1, @ Bi)’ ' ah sae 
= —[H, @ (By) "16 vec(Biy/ag) + [L, @ (Bi) *'] a, /a8. 


Noticing that B,, does not depend on 7, if [11.B.24] is stacked horizontally for € = 7,, 7 
. 74. the result is 


[11.B.24] 


dh,/am' = [1 @ (Bs) '] ah, /an’, 
as claimed in [11.7.14]. Similarly, if € is an element of Q, then é has no effect on W,, and 
its influence on B,, is given by 
@ vec(Bu) _ s 485 
a” ag 
Stacking [1!.B,24] horizontally with € representing euch of the elements of vech(Q) thus 
Produces 
ah 
afvech(Q)] 


as claimed in [1.7.15]. @ 


405 


~[H. @ (Bi) "Se ec 


Chapter 11 Exercises 


AL.1. Verify result [11.4.2]. 
11.2. Consider the following three-variable VAR; 
Vu = Bray + BYrs-1 + &, 
= Wiser + Fy 
Vu = EVruay t Yoraer + Wsr-1 + Ex 
(a) Is y,, block-exogenous with respect to the vector (yz, y3,)'? 


(b) Is the vector (y,,, ¥:,) block-exogenous with respect to y;,? 
(c) Is y;, block-exogenous with respect to the vector (y,,, y2,)'? 


11,3. Consider the following bivariate VAR: 
Vu = HYrrer F rar Ft Yi -p 
+ BrYss-1 + BrYri-2 ttt + Byrn + En 
Yu = Writ Wiser ttt Ware yo 
+ Bary + Soyn-2 tt + Sry + bx 
i a = 
fort=rT 
Qy Ay 
0 otherwise. 


E(e,£;) 


Use the results of Section 11.3 to write this in the form 
Yu = Stren + Siren Hott + Gen 
+ MYre-r F Mar Fett Warp F Uy 
Yu = AV t AV pt AW tt FAD Rp 
+ E2-1 + BVrs-2 to + Earp + Ua, 


at ° | 
fort =T 


0 otherwise. 
What is the relation between the parameters of the first representation (a, B;. yj. 8, Q,) 
and those of the second representation (¢), ;, A, &. 07)? What is the relation between e, 
and u,? 


where 


348 Chapter 11 | Vector Autoregressions 


11.4. Write the result for Exercise 11.3 as 


1-¢(L)  =n(L) i _ | 
—Ay — ACL) 1 - LYS Lyn Ur, 


or 
A(L)y, = u, 
Premultiply this system by the adjoint of A(L), 
1- &L L 
ww-| gL) n(L) |. 
Ay + ACL) 1 - &(L) 


to deduce that y,, and y,, each admit a univariate ARMA(2p, p) representation. Show how 
the argument generalizes to establish that if the (n x 1) vector y, follows a pth-order 
autoregression, then each individual element y,, follows an ARMA[np, (n — 1)p] process. 
(See Zellner and Palm, 1974). 


11.5. Consider the following bivariate VAR: 
Yu = 0.3y,,.1 + O8y¥2,-4 + ey 
Yn = O09 + OAV + Fan 
with E(e,,€,,) = 1 fort = rand 0 otherwise, E(e,,,) = 2 for t = 7 and 0 otherwise, and 
E(é,,€:,) = 0 for all ¢ and 7. 
(a) Is this system covariance-stationary? 
(b) Calculate VW, = dy,,,/de; fors = 0, 1. and 2, What is the limit as s > ~? 
(c) Calculate the fraction of the MSE of the two-period-ahead forecast error for 
variable 1, 
El yrus2 = EQ raed do Vim ede JP, 
that is due to €,,,,, and & ,42- 


Chapter 11 References 


Ashley, Richard. 1988. “On the Relative Worth of Recent Macroeconomic Forecasts.” 
International Journal of Forecasting 4:363-76. 

Baillie, Richard T. 1987. “Inference in Dynamic Models Containing ‘Surprise’ Variables." 
Journal of Econometrics 35:101-17. 

Bernanke, Ben. 1986. “Alternative Explanations of the Money-[ncome Correlation.” Car- 
negie-Rochester Conference Series on Public Policy 25:49-100. 

Blanchard, Olivier. 1989. “A Traditional Interpretation of Macroeconomic Fluctuations." 
American Economic Review 79:1146—64. 

and Peter Diamond, 1989. “The Beveridge Curve.”’ Brookings Papers on Economic 
Activity 1:1989, 1-60. 

and . 1990. “‘The Cyclical Behavior of the Gross Flows of U.S. Workers.” 
Brookings Papers on Economic Activity 11:1990, 85-155. 

and Danny Quah, 1989. “The Dynamic Effects of Aggregate Demand and Aggregate 
Supply Disturbances.” American Economic Review 79:655—73., 

and Mark Watson. 1986. “Are Business Cycles All Alike?” in Robert J. Gordon, 
ed., The American Business Cycle. Chicago: University of Chicago Press. 

Bouissou, M, B., J. J. Laffont, and Q. H. Vuong. 1986. “Tests of Noncausality under 
Markov Assumptions for Qualitative Panel Data.” Econometrica 54:395—414. 

Christiano, Lawrence J., and Lars Ljungqvist. 1988. ‘Money Does Granger-Cause Output 
in the Bivariate Money-Output Relation.” Journal of Monetary Economics 22:217-35. 
Cooley, Thomas F., and Stephen F. LeRoy. 1985. ‘‘Atheoretical Macroeconometrics: A 
Critique.” Journal of Monetary Economics 16:283-308. 

Fama, Eugene F. 1965. ‘The Behavior of Stock Market Prices.” Journal of Business 38:34— 
105. ; 

Feige, Edgar L., and Douglas K. Pearce. 1979. “The Casual Causal Relationship between 
Money and Income: Some Caveats for Time Series Analysis.” Review of Economics and 
Statistics 61:521-33. 


Chapter 11 References 349 


Flavin, Marjorie A. 1981. “The Adjustment of Consumption to Changing Expectations 
about Future Income.” Journal of Political Economy 89:974-1009, 

Gewecke, John. 1982. “Measurement of Linear Dependence and Feedback between Multiple 
Time Series.” Journal of the American Statistical Association 77:304-13. 

, Richard Meese, and Warren Dent. 1983. “Comparing Alternative Tests of Causality 
in Temporal Systems: Analytic Results and Experimental Evidence.” Journal of Econo- 
metrics 21:161—94, 

Giannini, Carlo. 1992. Topics in Structural VAR Econometrics. New York: Springer-Verlag. 
Granger, C. W. J. 1969. ‘Investigating Causal Relations by Econometric Models and Cross- 
Spectral Methods.’ Econometrica 37:424-38. 

Hamilton, James D. 1983. “Oil and the Macroeconomy since World War II.” Journal of 
Political Economy 91:228-48. 

. 1985. “Historical Causes of Postwar Oil Shocks and Recessions.” Energy Journal 
6:97-116. 

Hansen. Lars P., and Thomas J. Sargent. 1981. “Formulating and Estimating Dynamic 
Linear Rational Expectations Models,” in Robert E. Lucas, Jr., and Thomas J. Sargent, 
eds., Rational Expectations and Econometric Practice, Voi. I. Minneapolis: University of 
Minnesota Press. 

Keating, John W. 1990. “Identifying VAR Models under Rational Expectations.” Journal 
of Monetary Economics 25:453-76. 

Leamer, Edward. 1985. “‘Vector Autoregressions for Causal Inference?” Carnegie-Rochester 
Conference Series on Public Policy 22:255-303. 

Lucas, Robert E., Jr. 1978. ‘‘Asset Prices in an Exchange Economy.” Econometrica 46:1429- 
45. 

Littkepoh!, Helmut. 1989. ‘A Note on the Asymptotic Distribution of Impulse Response 
Functions of Estimated VAR Models with Orthogonal Residuals.” Journal of Econometrics 
42:371~76. 

. 1990. ‘Asymptotic Distributions of Impulse Response Functions and Forecast Error 
Variance Decompositions of Vector Autoregressive Models.” Review of Economics and 
Statistics 72:116-25, 

Magnus, Jan R. 1978. “Maximum Likelihood Estimation of the GLS Model with Unknown 
Parameters in the Disturbance Covariance Matrix.” Journal of Econometrics 7:281-312. 
and Heinz Neudecker. 1988. Matrix Differential Calculus with Applications in Sta- 
tistics and Econometrics. New York: Wiley. 

Pierce, David A., and Larry D. Haugh. 1977. ‘Causality in Temporal Systems: Character- 
ization and a Survey.” Journal of Econometrics $:265—93. 

Rothenberg, Thomas J. 1971. “Identification in Parametric Models.” Econometrica 39:577- 
OL. ; 
. 1973, Efficient Estimation with a Priori Information. New Haven, Conn.: Yale 
University Press. 

Runkle, David E. 1987. “Vector Autoregressions and Reality.” Journal of Business and 
Economie Statistics 5:437—42. 

Shapiro, Matthew D., and Mark W. Watson. 1988. “Sources of Business Cycle Fluctuations," 
in Stanley Fischer, ed., NBER Macroeconomics Annual 1988, Cambridge, Mass.: MIT Press. 
Sims, Christopher A. 1972. “Money, Income and Causality.” American Economic Review 
62:540-52. 

. 1980. “Macroeconomics and Reality.” Econometrica 48:1-48. 

. 1986. “Are Forecasting Models Usable for Policy Analysis?” Quarterly Review of 
the Federal Reserve Bank of Minneapolis (Winter), 2-16. 

Stock, James H., and Mark W. Watson. 1989. ‘Interpreting the Evidence on Money-Income 
Causality.” Journal of Econometrics 40:161-81. 

Theil, Henri. 1971, Principles of Econometrics. New York: Wiley. 

Zellner, Arnold. 1962. ‘An Efficient Method of Estimating Seemingly Unrelated Regres- 
sions and Tests for Aggregation Bias.” Journal of the American Statistical Association 57:348- 
68. 


and Franz Palm. 1974, “Time Series Analysis and Simultaneous Equation Econo- 
metric Models.” Journal of Econometrics 2:17-S4, 


350 Chapter 11 | Vector Autoregressions 


12 


Bayesian Analysis 


The previous chapter noted that because so many parameters are estimated in a 
vector autoregression, the standard errors for inferences can be large. The estimates 
can be improved if the analyst has any information about the parameters beyond 
that contained in the sample. Bayesian estimation provides a convenient framework 
for incorporating prior information with as much weight as the analyst feels it 
merits. 

Section 12.1 introduces the basic principles underlying Bayesian analysis and 
uses them to analyze a standard regression model or univariate autoregression. 
Vector autoregressions are discussed in Section 12.2. For the specifications in 
Sections 12.1 and 12.2, the Bayesian estimators can be found analytically. Nu- 
merical methods that can be used to analyze more general statistical problems from 
a Bayesian framework are reviewed in Section 12.3. 


12.1. Introduction to Bayesian Analysis 


Let @ be an (a X 1) vector of parameters to be estimated from a sample of 
observations. For example, if y, ~ iid. N(uw, a7), then @ = (pw, o7)' is to be 
estimated on the basis of y = (y;, y2,...,¥7)'. Much of the discussion up to this 
point in the text has been based on the classical statistical perspective that there 
exists some true value of @. This true value is regarded as an unknown but fixed 
number. An estimator 6 is constructed from the data, and 6 js therefore a random 
variable. In classical statistics, the mean and plim of the random variable 8 are 
compared with the true value 6. The efficiency of the estimator is judged by the 
mean squared error of the random variable, E(® — 6)(6 — @)'. A popular classical 
estimator is the value 6 that maximizes the sample likelihood, which for this example 
would be 


. 


= 2 
f(y; 8) = Taso 5""]. {12.1.1] 


In Bayesian statistics, by contrast, @ itself is regarded as a random variable. 
All inference about @ takes the form of statements of probability, such as “there 
is only a 0.05 probability that 6, is greater than zero.”’ The view is that the analyst 
will always have some uncertainty about @, and the goal of statistical analysis is to 
describe this uncertainty in terms of a probability distribution. Any information 
the analyst had about @ before observing the data is represented by a prior density 


351 


f(8).' Probability statements that the analyst might have made about @ before 
observing the data can be expressed as integrals of f(@); for example, the previows 
statement would be expressed as [fj f(@,) dé, = 0.05 where f(6,) = [%. f=. 

S%.. f(8) d0, dO; +--+ d6,. The sample likelihood [12.1.1] is viewed as the density 
of y conditional on the value of the random variable 6, denoted f(y|6). The product 
of the prior density and the sample likelihood gives the joint density of y and 0: 


fly, 8) = f(y|®)-f(@). [12.1.2] 


Probability statements that would be made about 6 after the data y have been 
observed are based on the posterior density of @, which is given by 


Fly. 8) 
FOly) = “Fg 


Recalling [12.1.2] and the fact that f(y) = [%. f(y, 8) d®, equation [12.1.3] can 
be written as 


[12.1.3] 


fiyle)-f(@) 
[7 ro19)-4@) 40 


which is known as Bayes’s law. In practice, the posterior density can sometimes 
be found simply by rearranging the elements in [12.1.2] as 


fly. ®) = F(@ly) Fy). 


where f(y) is’a density that does not involve @; the other factor, f(@ly), is then 
the posterior density. 


f@ly) = [12.1.4] 


Estimating the Mean of a Gaussian Distribution 
with Known Variance 


To illustrate the Bayesian approach, let y, ~ i.i.d. N(y, a?) as before and 
write the sample likelihood [12.1.1] as 


f(ylas o?) = aan ol [sto ~pebity - wn}, {12.1.5] 


where 1 denotes a (T X 1) vector of 1s. Here yw is regarded as a random variable. 
To keep the example simple, we will assume that the variance o? is known with 
certainty. Suppose that prior information about yw is represented by the prior 
distribution 4 ~ N(m, a/v): 


pe eae Sennen Opa (en) 
f(us 0?) = Gno%iv)2 exp| r02/v | {12.1.6] 


Here m and v are parameters that describe the nature and quality of prior infor- 
mation about 4. The parameter m can be interpreted as the estimate of yu the 
analyst would have made before observing y, with o?/v the MSE of this estimate. 
Expressing this MSE as a multiple (1/v) of the variance of the distribution for y, 
turns out to simplify some of the expressions that follow. Greater confidence in 
the prior information would be represented by larger values of v. 


‘Throughout this chapter we will omit the subscript that indicates the random variable whose density 


is being described; for example. f,,(8) will simply be denoted f(@). The random variable whose density 
is being described should always be clear from the context and the argument of f(- ). 


352 Chapter 12 | Bayesian Analysis 


To make the idea of a prior distribution more concrete, suppose that before 
observing y the analyst had earlier obtained a sample of N separate observations 
{z,,i = 1,2,..., N} from the N(u, o7) distribution. It would then be natural to 
take m to be the mean of this earlier sample (m = Z = (1/N)Z%,z,) and a/v to 
be the variance of Z, that is, to take v = N. The larger this earlier sample (N), 
the greater the confidence in the prior information. 

The posterior distribution for 4 after observing the sample y is described by 
the following proposition. 


Proposition 12.1: The product of [12.1.5] and [12.1.6] can be written in the form 
fluly; 07) f(y; 07), where 


: = 1 —(e - m*y 
uly) = Bastia } DE exp| wiv + | [12.1.7] 
fy; 0?) = any le + 1-1'/y| 7"? 
[12.1.8] 


% exp{ [- 112010 — m-1)'(Ip + 1-1'/v)- Ly - m-1)| 


w= (cia)e- GE) 
a ES v+T) [12.1.9] 
¥ = (WT) > y,. 


In other words, the distribution of 1 conditional on the data (y,, y2, -.+. Yr) 
is N(m*, oX{v + T)), while the marginal distribution of y is N(m-1, 
o(I; + bb/y)). 


With a quadratic loss function, the Bayesian estimate of y is the value #4 that 
minimizes E(w — j2)?. Although this is the same expression as the classical MSE, 
its interpretation is different. From the Bayesian perspective, wis a random variable 
with respect to whose distribution the expectation is taken, and 4 is a candidate 
value for the estimate. The optimal value for ” is the mean of the posterior 
distribution described in Proposition 12.1: 


er v a. T \_ 
Noe Er a ot he 


This is a weighted average of the estimate the classical statistician would use (Y) 
and an estimate based on prior information alone (m). Larger values of v correspond 
to greater confidence in prior information, and this would make the Bayesian 
estimate closer to m. On the other hand, as v approaches zero, the Bayesian 
estimate approaches the classical estimate y. The limit of [12.1.6] as vy 0 is known 
as a diffuse or improper prior density. In this case, the quality of prior informa- 
tion is so poor that prior information is completely disregarded in forming the 
estimate i. 

The uncertainty associated with the posterior estimate i is described by the 
variance of the posterior distribution. To use the data to evaluate the plausibility of 
the claim that 4) <  < m,, we simply calculate the probability f#' f(uly; 07) dy. 
For example, the Bayesian would assert that the probability that y is within the 
range & + 20/Vv + T is 0.95. 


12.1. Introduction to Bayesian Analysis 353 


Estimating the Coefficients of a Regression Model 
with Known Variance 


Next, consider the linear regression model 
y =x B+ u,, 


where u, ~ i.i.d. N(0, a2), x,is a(k x 1) vector of exogenous explanatory variables, 
and B is a (kK x 1) vector of coefficients. Let 


y) XxX] 

= y2 X = x3 
(Tx)) : (Txk) : 
Yr Xr 


Treating B as random but o? as known, we have the likelihood 


t=] 


wt hi 1 folks _ yi py 
f(ylB, X; 0?) = [] Gaoy? exo | slo. és pr} [12.1.10] 


1 1 ; 
= Garo2) O*P) | 352 (y — XB)'(y xe). 
Suppose that prior information about B is represented by a N(m, 07M) distribution: 


(Bi 0°) = Gooaea IMI-'" | [-s |e - m)'M-"(B - =} 
12.1.1] 


Thus, prior to observation of the sample, the analyst’s best guess as to the value 
of B is represented by the (k x 1) vector m, and the confidence in this guess is 
summarized by the (k x k) matrix o?M; tess confidence is represented by larger 
diagonal elements of M. Knowledge about the exogenous variables X is presumed 
to have no effect on the prior distribution, so that [12.1.11] also describes 
f(BIX; 07). 


Proposition 12.1 generalizes as follows. 


Proposition 12.2: The product of {12.1.10] and {12.1.11] can be written in the form 
F(Bly, X; 07) -f(y|X; 0”), where 
1 


X30?) = ——a M7 + X'X|!2 
f(Bly ) Gnaty | | [12.1.12] 


x exp| {20718 — m*)'(M~! + X'X)(B — m*)| 
2) = patie 6 ‘| -12 
FOS) eagle a (12.1.13 
x exp [- 1/(207)|(y — Xm)‘(1; + XMX’')~'(y — xm)} 


m* = (M7! + X'X)-'(M-'m pa X'y). {12.1.14] 


In other words, the distribution of B conditional on the observed data is 
N(m*, o2(M-~! + X'X)~') and the marginal distribution of y given X is 
N(Xm, o°(1; + XMX’)). 


354 Chapter 12 | Bayesian Analysis 


Poor prior information about B corresponds to a large variance M, or equiv- 
alently a small value for M~'. The diffuse prior distribution for this problem is 
often represented by the limit as M~! — 0, for which the posterior mean {12.1.14] 
becomes m* = (X‘X)~'X’y, the OLS estimator. The variance of the posterior 
distribution becomes o?(X'X)~!. Thus, classical regression inference is reproduced 
as a special case of Bayesian inference with a diffuse prior distribution. At the 
other extreme, if X’/X = 0, the sample contains no information about B and the 
posterior distribution is N(m, a?M), the same as the prior distribution. 

If the analyst’s prior expectation is that all coefficients are zero (m = 0) and 


this claim is made with the same confidence for each coefficient (M—' = A-I, for 
some A > 0), then the Bayesian estimator [12.1.14] is 
m* = (A-I, + X'X)7'X’y, {12.1.15] 


which is the ridge regression estimator proposed by Hoerl and Kennard (1970). 
The effect of ridge regression is to shrink the parameter estimates toward zero. 


Bayesian Estimation of a Regression Model 
with Unknown Variance 


Propositions 12.1 and 12.2 assumed that the residual variance a? was known 
with certainty. Usually, both o? and B would be regarded as random variables, 
and Bayesian analysis requires a prior distribution for o?. A convenient prior 
distribution for this application is provided by the gamma distribution. Let 
{Z,}%, be a sequence of iid. N(O, 7?) variables. Then W = 2,Z? is said to 
have a gamma distribution with N degrees of freedom and scale parameter A, 
indicated W ~ I'(N, A), where A = 1/7?. Thus, W has the distribution of 7? times 
a x°(N) variable. The mean of W is given by 


E(W) = N-E(Z?) = Nr? = NAA, [12.1.16] 
and the variance is 


E(w?) — [E(W)P 


N 


N-{E(Z?) — [E(Z7)}} 


12.1.17 
N:(3r4 — 14) = 2Nr* = 2N/A2, ! 


The density of W takes the form 
(A/2)%?wl42)- I expf —Aw/2] 
T(N/2) ‘ 
where I'(-) denotes the gamma function. If N is an even integer, then 
T(N/2) = 1:2-3-++- [(M/2) — I], 
with ['(2/2) = 1; whereas if N is an odd integer, then 
T(N/2) = Va-4-3-4 +++ [(N/2) - 1, 


f(w) = [12.1.18] 


with (3) = Vz. 

Following DeGroot (1970) and Leamer (1978), it is convenient to describe 
the prior distribution not in terms of the variance a? but rather in terms of the 
reciprocal of the variance, a~?, which is known as the precision. Thus, suppose 
that the prior distribution is specified as o~? ~ T'(N, A), where N and A are 
parameters that describe the analyst’s prior information: 


(A/2) 82g 4N2)- 1 expf — AoW2/2] 


f(o7?|X) = TN) [12.1.19] 


12.1. Introduction to Bayesian Analysis 355 


Recalling [12.1.16], the ratio N/A is the value expected for o~? on the basis of prior 
information. As we will see shortly in Proposition 12.3, if the prior information is 
based on an earlier sample of observations {z,, z2,.... Zw}, the parameter N 
turns out to describe the size of this earlier sample and A is the earlier sample's 
sum of squared residuals. For a given ratio of N/A, larger values for N imply greater 
confidence in the prior information. 

The prior distribution of B conditional on the value for o~? is the same as in 
{12.1.11]: 


—-2 = 1 -2 
fBlo , X) _ (2ara2)k2 |M| 


x ol [-s5]6 — m)'M~'(p —- mI. 


Thus, f(B, o~?|X), the joint prior density for B and o~?, is given by the product 
of {12.1.19] and [12.1.20]. The posterior distribution f(B, o~?|y, X) is described 
by the following proposition. 


[12. 1.20] 


Proposition 12.3: Let the prior density f(B, o—7|X) be given by the product of 
{12.1.19] and {12.1.20], and let the sample likelihood be 


F0VIB, 09.8) = Go exp [-ss|o ~ xBy'(y - xp}. [12.1.21] 


Then the following hold: 
(a) The joint posterior density of B and o~? is given by 
f(B. o-?|y, X) = FBlo-*, y, X)-f(o?ly, X), [12. 1.22] 
where the posterior distribution of B conditional on o~? is N(m*, o?M*): 


f(Blo~*, y, X) 


di 1 
~ Qma2h2 |M*|~'? ep Fac — m*)'(M*)"'(B - =}. 
{12.1.23] 
with 
m* = (M7! + X’X)-'(M~'m + X’y) {12.1.24] 
M* = (M7? + X‘X)7|. {12.1.25] 


Furthermore, the marginal posterior distribution of a~? is T(N*, A*): 


or ~ UN" 2)~ CA #/2)N*2 


flo~?ly, X) = TF exp[—A*o~7/2],  [12.1.26] 
with 

N®=N+T [12.1.27] 

A* =A + (y — Xb)'(y — Xb) 


+ (b — m)'M-'(X'X + M-'-!X'X(b — my [121.28] 


for b = (X'X)7'X'y the OLS estimator. 


(b) The marginal posterior distribution for B is a k-dimensional t distribution with 
N* degrees of freedom, mean m*, and scale matrix (A*/N*)-M*: 


356 Chapter 12 | Bayesian Analysis 


fBly. X) 
T(k + N*)/2] 


: lias Tena [ANIME] 


[12.1.29] 
(1+ GINB = my TAINDME WB = myer}, 


(c) Let R be aknown (m x k) matrix with linearly independent rows, and define 


o% [RB — m*)|'[R(M7' + XX) ORT [RB — Mm") 


LNE {12.1.30] 
Then Q has a marginal posterior distribution that is F(m, N*): 
mora N* N“RT N* + my2 |Quf2)- 1] 

Flaly. 8) = er (121.31) 


TOmi2)T(N*2)(N* + mg) en” 


Recalling [12.1.16], result (a) implies that the Bayesian estimate of the pre- 
cision is 


E(o~2|y, X) = N*/A*. [12.1.32] 


Diffuse prior information is sometimes represented as N = A = 0 and M™ = 0. 
Substituting these values into [12.1.27] and [12.1.28] implies that N* = T and 
A* = (y — Xb)’(y — Xb). For these values, the posterior mean {12.1.32] would 
be 


E(o~*ly, X) = Ti(y — Xb)'(y — Xb), 


which is the maximum likelihood estimate of o~?. This is the basis for the earlier 
claim that the parameter N for the prior distribution might be viewed as the number 
of presample observations on which the prior information is based and that A might 
be viewed as the sum of squared residuals for these observations. 

Result (b) implies that the Bayesian estimate of the coefficient vector is 


E(Bly, X) = m* = (M7! + X'X)-’"(M~'m + X‘y), [121.33] 


which is identical to the estimate derived in Proposition 12.2 for the case where 
a’ is known. Again, for diffuse prior information, m* = b, the OLS estimate. 

Result (c) describes the Bayesian perspective on a hypothesis about the value 
of RB, where the matrix R characterizes which linear combinations of the elements 
of B are of interest. A classical statistician would test the hypothesis that RB = r 
by calculating an OLS F statistic, 


(Rb — r)'{[R(X’X)-’R’] (Rb — r)/m 
s? , 
and evaluating the probability that an Fm, T — k) variable could equal or exceed 
this magnitude. This represents the probability that the estimated value of Rb could 
be as far as it is observed to be from r given that the true value of B satisfies 
RB = r. By contrast, a Bayesian regards RB as a random variable, the distribution 
for which is described in result (c). According to [12.1.30], the probability that RB 
would equal r is related to the probability that an F(m, N*) variable would assume 
the value , 
(r — Rm*)'{[R(M7! + X’X)~'R']-'(r — Rm*)/m 
A*/N* ; 


12.1, Introduction to Bayesian Analysis 357 


The probability that an F(m, N*) variable could exceed this magnitude represents 
the probability that the random variable RB might be as far from the posterior 
mean Rm* as is represented by the point RB = r. In the case of a diffuse prior 
distribution, the preceding expression simplifies to 


(r — Rb)'{R(X'X)~'R']-'(r — Rb)/m 
(y — Xb)'(y — Xb)/T ; 
which is to be compared in this case with an F(m, T) distribution. Recalling that 
s* = (y ~ Xb)'(y — Xb)(T — k), 


it appears that, apart from a minor difference in the denominator degrees of 
freedom, the classical statistician and the Bayesian with a diffuse prior distribution 
would essentially be calculating the identical test statistic and comparing it with 
the same critical value in evaluating the plausibility of the hypothesis represented 
by RB = r. 


Bayesian Analysis of Regressions with Lagged 
Dependent Variables 


In describing the sample likelihood (expression {12.1.10] or [12.1.21]), the 
assumption was made that the vector of explanatory variables x, was strictly ex- 
ogenous. If x, contains lagged values of y, then as long as we are willing to treat 
presample values of y as deterministic, the algebra goes through exactly the same. 
The only changes needed are some slight adjustments in notation and in the de- 
scription of the results. For example, consider a pth-order autoregression with 
x, = (1, ¥-1, Yea. +++ + Y-p)'. In this case, the expression on the right side of 
{12.1.21] describes the likelihood of (y,, y2,.... Yr) conditional on yo, y_),..-, 
Y-—p+ys that is, it describes f(y|B, o~?, x,). The prior distributions [12.1.19] and 
{12.1.20] are then presumed to describe f(a ~?|x,) and f(B|o ~?, x,), and the pos- 
terior distributions are all as stated in Proposition 12.3. 

Note in particular that results (b) and (c) of Proposition 12.3 describe the 
exact small-sample posterior distributions, even when x, contains lagged dependent 
variables. By contrast, a classical statistician would consider the usual ¢ and F tests 
to be valid only asymptotically. 


Calculation of the Posterior Distribution Using 
a GLS Regression 


It is sometimes convenient to describe the prior information in terms of certain 
linear combinations of coefficients, such as 
RB|o-? ~ Mr, o?V). {12.1.34] 
Here R denotes a known nonsingular (kK x &) matrix whose rows represent linear 
combinations of B in terms of which it is convenient to describe the analyst’s prior 
information. For example, if the prior expectation is that 8, = 2, then the first 
row of R could be (1, —1, 0, ..., 0) and the first element of r would be zero. 
The (1, 1) element of V reflects the uncertainty of this prior information. If B ~ 
Nim, o?M), then RB ~ M(Rm, o?RMR’). Thus, the relation between the parameters 
for the prior distribution as expressed in [12.1.34] (R, r, and V) and the parameters 
for the prior distribution as expressed in [12.1.20] (m and M) is given by 


r= Rm {12.1.35] 
V = RMR’. {12.1.36] 


358 Cha ter 12 | Bayesian Analysis 


Equation [12.1.36] implies 
V> = (R'‘)MR>?, [12.1.37] 
If equation [12.1.37] is premultiplied by R‘ and postmultiplied by R, the result is 
R'V’R=M™. {12.1.38] 


Using equations {12.1.35] and [12.1.38], the posterior mean [12.1.33] can be re- 
written as 


m* = (R'V-’R + X'X)-'(R'V-'r + X’y). [12.1.39] 


To obtain another perspective on [12.1.39], notice that the prior distribution 
{12.1.34] can be written 


r=RB +e, [12.1.40] 


where « ~ N(0, o?V). This is of the same form as the observation equations of 
the regression model, 


=XB+u {12.1.41] 


with u ~ N(0, o71;). The mixed estimation strategy described by Theil (1971, pp. 
347-49) thus regards the prior information as a set of k additional observations, 
with r; treated as if it were another observation on y, and the ith row of R corre- 
sponding to its vector of explanatory variables x;. Specifically, equations [12.1.40] 
and {12.1.41] are stacked to form the system 


= X*B + .u*, [12.1.42] 


where 


* 


r R 
= x* = 
eaten :| (T+k)xk [f| 


v0 
2y" — 42 
a-V o b a 


The GLS estimator for the stacked system is 


l 


E(u*u *') 


b = [X*'(V*)—IX*]" [X*'(V*)-y*] 


-1 
Vv OFF R Vv Oflr 

— t i t , 

{or xi] elle} te [SILT] 

= (R'V-'R + X'X)-'(R'V-'r + X’y). 
Thus the posterior mean [12.1.39] can be calculated by GLS estimation of [12.1.42]. 
For known o?, the usual formula for the variance of the GLS estimator, 

o>[X*/(V*) UX*]-) = 0° (R'V OR + XX)", 


gives a correct calculation of the variance of the Bayesian posterior distribution, 
o?(M7) + X'X)7!. 

The foregoing discussion assumed that R was a nonsingular (k x k) matrix. 
On some occasions the analyst might have valuable information about some linear 
combinations of coefficients but not others. Thus, suppose that the prior distribution 


12.1, Introduction to Bayesian Analysis 359 


{12.1.34] is written as 


[rle-a([e}- Ls). 


where R, is an (m xX k) matrix consisting of those linear combinations for which 
the prior information is good and R, is a [(k — m) X k] matrix of the remaining 
linear combinations. Then diffuse prior information about those linear combina- 
tions described by R, could be represented by the limit as V7’ — 0, for which 


v;' 0 


tyw-op , , 
R'V = [Ri ral 0 vs! 


—>{R)V;’ 0]. 


The Bayesian estimate [12.1.39] then becomes 
(RiV;'R, + X’X)"'(RiVy'r, + X'y), 


which can be calculated from GLS estimation of a [{(T + m) x 1] system of the 
form of [12.1.42] in which only the linear combinations for which there is useful 
prior information are added as observations. 


12.2. Bayesian Analysis of Vector Autoregressions 


Litterman’s Prior Distribution for Estimation of an Equation 
of a VAR 


This section discusses prior information that might help improve the estimates 
ofa single equation of a VAR. Much of the early econometric research with dynamic 
relations was concerned with estimation of distributed lag relations of the form 


Y, = CF WX, + WX-~ $+ + OX, + Uy. {12.2.1] 


For this specification, w, has the interpretation as dy,/ax,_,, and some have argued 
that this should be a smooth function of s; see Almon (1965) and Shiller (1973) 
for examples. Whatever the merit of this view, it is hard to justify imposing a 
smoothness condition on the sequences {w,}?_, or {¢,}?_, in a model with auto- 
regressive terms such as 


yaet Piy,-) + b2y,-2 aba eB dpYi-p 
+ WX, + WX, Foss + WX_p FU, 


since here the dynamic multiplier dy,/ax,_, is a complicated nonlinear function of 
the $’s and w’s. 

Litterman (1986) suggested an alternative representation of prior information 
based on the belief that the change in the series is impossible to forecast: 


Y—~ Vat Het &, {12.2.2] 


where ge, is uncorrelated with lagged values of any variable. Economic theory 
predicts such behavior for many time series. For example, suppose that y, is the 
log of the real price of some asset at time ¢, that is, the price adjusted for inflation. 
Then y, — y,-, is approximately the real rate of return from buying the asset 
at ¢ — 1 and selling it at z. In an extension of Fama’s (1965) efficient markets 
argument described in Section 11.2, speculators would have bought more of the 
asset at time ¢ — 1 if they had expected unusually high returns, driving y,-, up in 


360 Chapter 12 | Bayesian Analysis 


relation to the anticipated value of y,. The time path for {y,} that results from such 
speculation would exhibit price changes that are unforecastable. Thus, we might 
expect the real prices of items such as stocks, real estate, or precious metals to 
satisfy [12.2.2]. Hall (1978) argued that the level of spending by consumers should 
also satisfy [12.2.2], while Barro (1979) and Mankiw (1987) developed related 
arguments for the taxes levied and new money issued by the government. Changes 
in foreign exchange rates are argued by many to be unpredictable as well; see the 
evidence reviewed in Diebold and Nason (1990). 
Write the ith equation ina VAR as 


Yo = 6 + OPW gor + Oa tt + Oe 
+ O0 5-2 + OP yaya to + oe re {12.2.3] 
+ OP Y yop + OY Yo4-p AR ae Oe Yap + Ey, 


where ${ gives the coefficient relating y, to y,,_,. The restriction [12.2.2] requires 
¢;;’ = 1 and all other 6“ = 0, These values (0 or 1) then characterize the mean 
of the prior distribution for the coefficients. Litterman used a diffuse prior distri- 
bution for the constant term c,. 

Litterman took the variance-covariance matrix for the prior distribution to 
be diagonal, with y denoting the standard deviation of the prior distribution 


for p{)): 
o})) ~ N(I, y?). 


Although each equation i = 1, 2,..., of the VAR is estimated separately, 
typically the same number y is used for each i. A smaller value for y represents 
greater confidence in the prior information and will force the parameter estimates 
to be closer to the values predicted in [12.2.2]. A value of y = 0.20 means that, 
before seeing the data, the analyst had 95% confidence that $ is no smaller than 
0.60 and no larger than 1.40. 

The coefficients relating y,, to further lags are predicted to be zero, and 
Litterman argued that the analyst should have more confidence in this prediction 
the greater the lag. He therefore suggested taking $2 ~ N(0, (7/2)?), 
$) ~ N(0, (7/3)?), .. . ,and 6”) ~ N(0, (y/p)?), tightening the prior distribution 
with a harmonic series for the standard deviation as the lag increases. 

Note that the coefficients {) are scale-invariant; if each value of y, is mul- 
tiplied by 100, the values of { will be the same. The same is not true of $f) for 
i # j; if series i is multiplied by 100 but series j is not, then ${)) will be multiplied 
by 100. Thus, in calculating the weight to be given the prior information about 

4, an adjustment for the units in which the data are measured is necessary. 
Litterman proposed using the following standard deviation of the prior distribution 
for 6): 


[12.2.4] 


Here (#,/4;) is a correction for the scale of series i compared with series j. Litterman 
suggested that 7; could be estimated from the standard deviation of the residuals 
from an OLS regression of y,, on a constant and on p of its own lagged values. 
Apart from this scale correction, [12.2.4] simply multiplies y/s (which was the 
standard deviation for the prior distribution for ¢%) by a parameter w. Common 
experience with many time series is that the own lagged values y,,_, are likely to 


12.2. Bayesian Analysis of Vector Autoregressions 361 


be of more help in forecasting y, than will be values of other variables y,,_,. Hence 
we should have more confidence in the prior belief that #{ = 0 than the prior 
belief that {” = 0, suggesting a value for w that is less than 1. Doan (1990) 
recommended a value of w = 0.5 in concert with y = 0.20. 

Several cautions in employing this prior distribution should be noted. First, 
for some series the natural prior expectation might be that the series is white noise 
rather than an autoregression with unit coefficient. For example, if y,, is a series 
such as the change in stock prices, then the mean of ¢/)) should be 0 rather than 1. 
Second, many economic series display seasonal behavior. In such cases, ¢/") is likely 
to be nonzero for s = 12 and 24 with monthly data, for example. Litterman’s prior 
distribution is not well suited for seasonal data, and some researchers suggest using 
seasonally adjusted data or including seasonal dummy variables in the regression 
before employing this prior distribution, Finally, the prior distribution is not well 
suited for systems that exhibit cointegration, a topic discussed in detail in Chapter 
19. 


Full-Information Bayesian Estimation of a VAR 


Litterman’s approach to Bayesian estimation of a VAR considered a single 
equation in isolation. It is possible to analyze all of the equations in a VAR together 
in a Bayesian framework, though the analytical results are somewhat more com- 
plicated than for the single-equation case; see Zellner (1971, Chapter 8) and Roth- 
enberg (1973, pp. 139-44) for discussion. 


12.3. Numerical Bayesian Methods 


In the previous examples, the class of densities used to represent the prior infor- 
mation was carefully chosen in order to obtain a simple analytical characterization 
for the posterior distribution. For many specifications of interest, however, it may 
be impossible to find such a class, or the density that best reflects the analyst's 
prior information may not be possible to represent with this class. It is therefore 
useful to have computer-based methods to calculate or approximate posterior mo- 
ments for a quite general class of problems. 


Approximating the Posterior Mean by the Posterior Mode 


One option is to use the mode rather than the mean of the posterior distri- 
bution, that is, to take the Bayesian estimate @ to be the value that maximizes 
f(®8ly). For symmetric unimodal distributions, the mean and the mode will be the 
same, as turned out to be the case for the coefficient vector B in Proposition 12.2. 
Where the mean and mode differ, with a quadratic loss function the mode is a 
suboptimal estimator, though typically the posterior mode will approach the pos- 
terior mean as the sample size grows (see DeGroot, 1970, p. 236). 

Recall from [12.1.2] and [12.1.3] that the posterior density is given by 


_ £(yl®)-f() 
f@ly) = 70)” [12.3.1] 
and therefore the tog of the posterior density is 
log f(@ly) = log f(y|®) + tog f(®) — log f(y). [12.3.2] 


Note that if the goal is to maximize [12.3.2] with respect to , it is not necessary 


362 Chapter 12 | Bayesian Analysis 


to calculate f(y), since this does not depend on @. The posterior mode can thus be 
found by maximizing 


log f(@, y) = log f(y|) + log f@). [12.3.3] 


To evaluate [12.3.2], we need only to be able to calculate the likelihood function 
f(y|@) and the density that describes the prior information, f(@). Expression [12.3.2] 
can be maximized by numerical methods, and often the same particular algorithms 
that maximize the log likelihood will also maximize [12.3.2]. For example, the log 
likelihood for a Gaussian regression model such as {12.1.21] can be maximized by 
a GLS regression, just as the posterior mode {[12.1.39] can be calculated with a 
GLS regression. 


Tierney and Kadane’s Approximation 
for Posterior Moments 


Alternatively, Tierney and Kadane (1986) noted that the curvature of the 
likelihood surface can be used to estimate the distance of the posterior mode from 
the posterior mean. Suppose that the objective is to calculate 


E{g(®)ly] = i g(®)-f(@ly) 48, [12.3.4] 


where @ is an (a X 1) vector of parameters and g: R‘—> R? is a function of interest. 
For example, if g(8) = 6,, then [12.3.4] is the posterior mean of the first parameter, 
while g(8) = 6? gives the second moment. Expression [12.3.1] can be used to write 
{12.3.4] as 


[7 8(@)-rvl9)-r@) 40 {”_9(0)-f0yl9)-f(0) 0 


E(20)|¥| = > [12.3.5] 
10) [7 rot9)-s00) 40 
Define 
A(®) = (1/T) log{g(®)- F(y|®)- f(8)} [12.3.6] 
and 
k(8) = (1/T) log{f(y|®)- f(®)}. [12.3.7] 
This allows [12.3.5] to be written 
° _ explTA(8)] d0 
E{g(8)ly] = [12.3.8] 


‘ie exp{T- k(@)] do 


Let @* be the value that maximizes [12.3.6], and consider a second-order 
Taylor series approximation to 4(8) around 0*: 


A(®) = h(0*) + “o) -(0 — 6") 
o-e" - [12.3.9] 
1 — Ar) ah(8) — at 
+5 - 8) (z 7 Lt (8 — 6*). 


12.3. Numerical Bayesian Methods 363 


Assuming that 6* is an interior optimum of A(-), the first derivative 
{ah(@)/40']|p-° is 0. Then [12.3.9] could be expressed as 


h(®) = A(0*) — (1/2)(@ — 0*)'(2*)-'(8 — 6%), {12.3.10] 
where 


x= [2 


~1 
anal sl . [12.3.11] 


When [12.3.10] is substituted into the numerator of [12.3.8], the result is 


{ . exp[T-h(8)] d0 


W 


ie exp| T-A(") — (T/2)(@ — 6*)'(S*)-"(8 — 0} d0 


i} 


exp[T-h(0*)] i exp| (- 7230 — 6*)'(S*)-'(0 - of d0 


exp[T:h(0*)] (27)"2|E*/T)2 


S 1 1 ee ; 
i leaps [2° HOVE O = 8 )} a9 


exp[T-h(0*)] (27)""|E*/7]!2. 


(12.3.12] 


The last equality follows because the expression being integrated is a N(0*, %*/T) 
density and therefore integrates to unity. 

Similarly, the function k(@) can be approximated with an expansion around 
the posterior mode 6, 


a 


k(@) = k@) - 5 (0 - 6yE-"0 - 6), 


where 6 maximizes [12.3.7] and 


. __ | a?k(8) 
ie E a0’ 


| . [12.3.13] 


The denominator in {12.3.8] is then approximated by 
[ exp[T-k(8)] d0 = exp[T- k(6)] (277)"4|3/T|!2. {12.3.14] 


Tierney and Kadane’s approximation is obtained by substituting [12.3.12] and 
[12.3.14] into [12.3.8]: 


exp[T-h(0*)] (22)*?|7/T|'!? 
exp[T: k(8)] (20)*7|%/T|'” {12.3.15] 
= ac exp{T-[(0*) — k(6)}}. 


E{g(®)ly] = 


To calculate this approximation to the posterior mean of g(@), we first find the 
value @* that maximizes (1/T)-{log g(@) + log f(y|®) + tog f(@)}. Then A(6*) in 
{12.3.15] is the maximum value attained for this function and &* is the negative 
of the inverse of the matrix of second derivatives of this function. Next we find 
the value 6 that maximizes (1/T) {log f(y|@) + log f(@)}, with k(6) the maximum 
value attained and % the negative of the inverse of the matrix of second derivatives. 


364 Chapter 12 | Bayesian Analysis 


The required maximization and second derivatives could be calculated analytically 
or numerically. Substituting the resulting values into [12.3.15] gives the Bayesian 
posterior estimate of g(@). 


Monte Carlo Estimation of Posterior Moments 


Posterior moments can alternatively be estimated using the Monte Carlo 
approach suggested by Hammersley and Handscomb (1964, Section 5.4) and Kloek 
and van Dijk (1978). Again, the objective is taken to be calculation of the posterior 
mean of g(@). Let /(8) be some density function defined on @ with /(8) > 0 for all 
6. Then [12.3.5] can be written 


% 


te (8) -f(y|®)-f(8) de 
E[2(®)|y] = ———--—-—_——_—- 
l. f(y|®)-f(8) de ee 


| ie {g(8)-f(y|)- (8)/1(8)}1(8) a0 
J ; _ U(y18)-F(8)/1(0)}1(8) 40 


The numerator in [12.3.16] can be interpreted as the expectation of the random 
variable {g(8) -f(y|@)-f(8)//(8)}, where this expectation is taken with respect to 
the distribution implied by the density /(@). If /(@) is a known density such as 
multivariate Gaussian, it may be simple to generate N separate Monte Carlo draws 
from this distribution, denoted {0", 6, .. . , 6}. We can then calculate the 
average realized value of the random variable across these Monte Carlo draws: 


N 
> (WN)-{g(8)-F(y18) -F(0)/1(0)}. (12.3.17] 
f=] 

From the law of large numbers, as N — ~, this will yield a consistent estimate of 


Exo g(®) -F(y 18) -f(8)/1(8)} = ee {g()-F(y|®)-f(8)//(8)}4() 48, [12.3.18] 


provided that the integral in [12.3.18] exists. The denominator of [12.3.16] is sim- 
ilarly estimated from 


N 


> (UN) -{F(y] 0) -f(8)/£(8)}. 


i=) 


The integral in [12.3.18] need not exist if the importance density /(@) goes 
to zero in the tails faster than the sample likelihood f(y|®). Even if [12.3.18] does 
exist, the Monte Carlo average [12.3.17] may give a poor estimate of [12.3.18] for 
moderate N if /(@) is poorly chosen. Geweke (1989) provided advice on specifying 
1(8). If the set of allowable values for @ forms a compact set, then letting /(@) be 
the density for the asymptotic distribution of the maximum likelihood estimator is 
usually a good approach. 

A nice illustration of the versatility of Bayesian Monte Carlo methods for 
analyzing dynamic models is provided by Geweke (1988a). This approach was 
extended to multivariate dynamic systems in Geweke (1988b). 


12.3. Numerical Bayesian Methods 365 


APPENDIX 12.A. Proofs of Chapter 12 Propositions 


™ Proof of Proposition 12.1. Note that the product of [12.1.5] and [12.1.6] can be written 


5 1 ae 
fly. wy 0?) = Gmerene lel" exp{ fa >» al. {12.A.1] 
where 

-m 
a = i [12.A.2] 

(TH px y- pel 

ofp 0 
>» = . {12.A.3] 

(THD x(THD 0 ol, 


The goal is to rearrange a so that «4 appears only in the first element. Define 


+ -Vi(v + T)]) 
= ls ) Ss i [12.4.4] 
(T+ (THD 1 I, 
Since 1'1 = T and l’y = TY, we have 
‘ is + Tu - m) - Vv + 7) + [TH + mie) 
a= 
y-ml 
_[e- | [12.4.5] 
7 y- mt 
= a* 
and ‘ 
“sie 7M +T) -VUWi(v+ 2 vi(v + T) | 
Vp I, -Vv+ T) IL, 
S ie + T) 0’ [12.A.6] 
7 0 ol, + 11'/v) 
= t*. 
Thus, 


oS ~'a = a'A'(A’)~'S"'A-'Aw = (Aq)'(AZA’) (Aa) = a*'(E*) 'a*, [12.4.7] 
Moreover, observe that A can be expressed as 


ie -1/(v + "| F el 
0 I, 1 I, 


Each of these triangular matrices has 1s along the principal diagonal and so has unit deter- 
minant, implying that |A] = 1. Hence, 


|=*] = JAl-]2]-]A’] = |2I. [12.A.8] 
Substituting [12.A.5] through [12.A.8] into [12.A.1] gives 
f(y. #3 2°) 
1 : Lvenene 
~ mre |=*|-1" exp| ~x0"'(%") ar} 
_ 1 {o¥ + 7) 0’ oe 
(2a)T +8 0 oI, + 1-¥/v) 
{ aK - ae + T) 0’ } 
X exp} —> 
2ly- m1 0 oI, + 1-P/y) [12.4.9] 


366 Chapter 12 | Bayesian Analysis 


“2 


oI, + 1-1/») 


1 $ -U2 
ee 

© (QnyTrne |; + ;| ; 

~<a m YP (y-mDG ty) - | 


- exp| 2a07/(v + T) 20° 


from which the factorization in Proposition 12.1 follows immediately, ™ 


™ Proof of Proposition 12.2. The product of [12.1.10] and [12.1.11] can be written as 


: i 1 es I; pee 
f(y, B|X; 0°) = Gmina |*  exp| ~} e's a} 


with 
Pl 
a = 
(TeRY«I y — Xp 
o 0 
z | | 
(THk)x(T +k) 0 oI, 
As in the proof of Proposition 12.1, define 
be -—(M-! + =| Ek | 
(T+ kK (THA) = 0 I, xX I,. 
hs e + X’X)"'M7! -(M°! + aks) 
_ x I, . 
Thus, A has unit determinant and 
Aa = ki ae 
y —- Xm 
with 
2(M~' + X’X) ! 0 
AEA’ = ‘é te | 
0 o*(I, + XMX') 
Thus, as in equation [12.A.9], 
b Sahai « al o?(M-! + X’X)"! 0 ii 
IQ. BIS 0°) = Ghana 0 o%(I; + XMX') 


‘ToXM>! + X'X) | 0 ‘Tp - m* ‘ 
0 o*(I, + XMX’)|_ Ly - Xm| f° 


“epee eo 
P 2/y ~ Xm 
™ Proof of Proposition 12.3(a). We have that 
f(y, B, o-7|X) = f(ylB. 0-2, X)-f(Blo~?, X)-f(o~?|X). [12.A.10] 
The first two terms on the right side are identical to [12.1.10] and [12.1.11]. Thus, Proposition 
12.2 can be used to write [12.A.10] as 
f(y. B, 77?|X) 


1 
> lates [Me] "° e{|-t Ie ~ m*)'(M*)-(B - |} 
[12.A.11] 


(2702)? 


x expf [UIC ~ Xm)'(I; + XMX’)-y - xm} | 


(A2)Xo0 ~4we2y- exp[— a 
e T(N/2) : 


Appendix 12.A. Proofs of Chapter 12 Propositions 


«{ 1 |, + XMX’|- 


367 


Define 
* =A + (y — Xm)'(I; + XMX‘)~'(y — Xm); {12.A.12] 


we will show later that this is the same as the value A* described in the proposition. For 
N* =WN + T, the density [12.A.11] can be written as 


f(y. B, o7*|X) 


= —_|me|-12 exp oot (B a m*)'(M*)-'(B = m*) 
(2707) 2a 


HUN QQ)" AciB Ato"? 
«{ Gayratnny [tr + RMT" exp] 


[12.A.13] 


- {aasall . exo{ [si] ~ my (M*)-(B - mt 


oN ?y- NWAt/2)"2 A*o-? 
x T(N*/2) exp} -—> 


r N*/2) A/2)N? 2 ae 
x (ag {L, + XMX’|-"2}. 


The second term does not involve B, and the third term does not involve B or a~*. Thus, 
[12.A.13] provides the factorization 


fly. B, o77|X) = {f(Blo Vy. OL f(y. OPAL F(y/X)}. 


where f(B|o~2, y, X) is a N(m*, o?M*) density, f(o~*|y, X) is a P(N*, A*) density, and 
f(y|X) can be written as 


ae J LINT2)AI2NE ‘ei 
roe = { TNT) A) |p + XMX"| | 


(2m) "F(NI2)(a*72) 


‘ { P(N + T)2]A%2|1, + XMX'| 2 } 


mT™T(N/24{A + (y ~ Xm)'(I; + XMX’)""(y — Xm)}" 177 
e-{l + (/N)(y — Xm)’[(A/N (I, + XMX’)]~'"(y — Xm)} 741792, 


where 


«= P+ TA|UIN)™=|IN) I, + XMX)| “2 


a 2P(N/2) 


Thus, f(y|X) is a 7-dimensional Student’s ¢ density with N degrees of freedom, mean 
Xm, and scale matrix (A/N)(I; + XM‘X’). Hence, the distributions of (B|o~*, y, X) and 
(o~?|y, X) are as claimed in Proposition 12.3, provided that the magnitude A* defined in 
{12.A.12] is the same as the expression in [12.1.28]. To verify that this is indeed the case, 
notice that 

(I; + XMX‘)-' = I, ~ X(X’X + M-')"'X', {12.A.14] 


as can be verified by premultiplying [12.A.14] by (I, + XMX’): 
(I; + XMX’)[I, — X(X'X + M~')-'X'] 
= 1, + XMX’ — X(X’X + M-')-'X' — XM(X’X)(X'X + M7')-'X’ 


=I[,+ x{ MOx'x +M')-Ii- mox'x) x + M-')-!X’ 
=I,. 
Using [12.A.14], we see that 
(y ~ Xm)'(Ir + XMX’)"'(y — Xm) 
= (y — Xm)’'[I; — X(X'X + M7')-'X’](y — Xm) 
= (y — Xb + Xb — Xm)'[I; — X(X'X + M7')-'X’(y — Xb + Xb ~ Xm) 


= (y ~ Xb)'(y — Xb) + (b — m)’X’[I, — X(X'K + M-')-'X’]X(b — m), 
. [12.4.15] 


368 Chapter 12 | Bayesian Analysis 


where cross-product terms have disappeared because of the OLS orthogonality condition 
(y ~ Xb)'X = 0’. Furthermore, 
Xf; — X(X‘X + M7')7'X']X 
= (I, - (X’X)(X'X + M-')-X'x 
[C8'X + Mo')\(X'X + M7-!)-! — (X'X)(X'X + M7!) X'X 
M-'(X’X + M-')-'X’X. 


t 


This allows {12.A.15] to be written as 


(y — Xm)'(E, + XMX')-'(y — Xm) 
= (y — Xb)'(y — Xb) + (b — m)'M '(X’'X + M°!) 'X’X(b — m), 


establishing the equivalence of [12.A.12] and [12.1.28]. 


Proof of (b). The joint posterior density of B and o ? is given by 


f(B, o~?ly, X) 
= f(Blo~?, y. X)-f(o~?ly, X) 


lassie exo{|-sts] 6 — m*y'(M*)-(B - wm} 


7 UNI A t/2) VR Panne 
x { T*/2) exp[—A*a ra} 


‘ otk + Nev2}- 1) At (kK+N*V2 
= [ys x (e-n + (B - m*)'(A°M")-"(B - ny} 


x exp AL + @ mM 1B - oe} 
x {GMO IML +B — my arMey-G ~ mya} 
= (flo-*18. ¥. X)}-(Bly, 0} 


where f(o~|B, y, X) will be recognized as a P((k + N*), A*[L + (B — m*)'(A*M*)~! x 
(B — m*)]) density, while f(B|y, X) can be written as 


T(k + N*)2] 


f(Bly, X) = {a eee |(A*/N*)M*| - "2 


x [L + (UN*)(B — m*)[(A*/N*)M*]-(B - myer, 


which is a k-dimensional ¢ density with N* degrees of freedom, mean m*, and scale matrix 
(A*/N*)M*. 


Proof of (e). Notice that conditional on y, X, and ¢?, the variable 
Z = [R(B — m*)]'[o7R(M~' + X'X)-'R‘]-'-[R(B — m*)] 


is distributed y7(m), from Proposition 8.1. The variable Q in [12.1.30] is equal to Z-o?N*/ 
(mA*), and so conditional on y, X, and a7, the variable Q is distributed P(m, (mA*)/(a2N"*)): 


_ [mA*/(202N*)] 2g l"2)- 1 expf —ma*gi(2o7N*)] 


f(glo-*. ys X) Gr 


[12.A.16] 


Appendix 12.A. Proofs of Chapter 12 Propositions 369 


The joint posterior density of g and a? is 
f(a, oly. X) = fqlo-*, y. X) f(a ly, X) 
_ J fmat/202N*)]"2q'n?)"'! exp[ —ma*gi(2a°N*)] 
oi T(m/2) 
ok") (0/2) Ne iret 
x { T(N*/2) exp[ —A*a~7/2] 
‘ {ew + mg)-[Ari2ney]yor te 


PE(N* + 1m)/2] [12.A.17] 


X o Mowe NV) exnf—(N* + mg)(A*/N*)o -al} 
ee + m)f2]q io =} 
P(m/2)P(N*/2)(N* + mg)ine + wnet 
= {f(o-7lq. y. X)} {F(gly. XD}. 


where f(a “2|q, y, X) is a [((m + N*). (N* + mg)(A*/N*)) density and f(qly, X) is an 
F(n, N*) density. ™ 


Chapter 12 Exercise 


12.1. Deduce Proposition 12.1 as a special case of Proposition 12.2. 


Chapter 12 References 


Almon, Shirley. 1965. “The Distributed Lag between Capital Appropriations and Expen- 
ditures."" Econometrica 33:178-96. 

Barro, Robert J. 1979. “On the Determination of the Public Debt.” Journal of Political 
Econonmty 87:940-71. 

DeGroot, Morris H. 1970. Optimal Statistical Decisions. New York: McGraw-Hill. 
Diebold, Francis X., and James A. Nason. 1990. ee Exchange Rate Predic- 
tion?” Journal of International Economics 28:315-32. 

Doan, Thomas A. 1990. RATS User’s Manual. VAR Econometrics, Suite 612, 1800 Sherman 
Ave., Evanston, IL 60201. 

pues Eugene F. 1965. ‘The Behavior of Stock Market Prices.” Journal of Business 38:34- 
105. 

Geweke, John. 1988a. ‘The Secular and Cyclical Behavior of Real GDP in 19 OECD 
Countries, 1957-1983.” Journal of Business and Economic Statistics 6:479-86. 

. 1988b. ‘‘Antithetic Acceleration of Monte Carlo Integration in Bayesian Inference." 
Journal of Econometrics 38:73-89. 

. 1989, “Bayesian Inference in Econometric Models Using Monte Carlo Integration.” 
Econometrica 57:1317-39. 

Hall, Robert E. 1978. ‘Stochastic Implications of the Life Cycle~Permanent Income Hy- 
pothesis: Theory and Evidence.” Journal of Political Economy 86:971-87. 

Hammersley, J. M., and D. C. Handscomb. 1964. Monte Carlo Methods, 1st ed. London: 
Methuen. 

Hoerl, A. E., and R. W. Kennard. 1970. “Ridge Regression; Biased Estimation for Non- 
orthogonal Problems.” Technometrics 12:55-82. 

Kloek, T., and H. K. van Dijk. 1978. “Bayesian Estimates of Equation System Parameters: 
An Application of Integration by Monte Carlo.” Econometrica 46:1-19. 

Leamer, Edward E. 1978. Specification Searches: Ad Hoc Inference with Nonexperimental 
Data. New York: Wiley. 


370 Chapter 12 | Bayesian Analysis 


Litterman, Robert B. 1986. “Forecasting with Bayesian Vector Autoregressions—Five Years 
of Experience.” Journal of Business and Economic Statistics 4:25-38. 


Mankiw, N. Gregory. 1987. “The Optimal Collection of Seigniorage: Theory and Evidence.” 
Journal of Monetary Economics 20:327-41. 


Rothenberg, Thomas J. 1973. Efficient Estimation with A Priori Information. New Haven, 
Conn.: Yale University Press. 


Shiller, Robert J. 1973. ‘A Distributed Lag Estimator Derived from Smoothness Priors.” 
Econometrica 41:775-88. 


Theil, Henri. 1971. Principles of Econometrics. New York: Wiley. 


Tierney, Luke, and Joseph B. Kadane. 1986. “Accurate Approximations for Posterior 
Moments and Marginal Densities.” Journal of the American Statistical Association 81:82- 


Zellner, Arnold. 1971. An Introduction to Bayesian Inference in Econometrics. New York: 
Wiley. 


Chapter 12 References 371% 


13 


The Kalman Filter 


This chapter introduces some very useful tools named for the contributions of 
R. E. Kalman (1960, 1963). The idea is to express a dynamic system in a particular 
form called the state-space representation. The Kalman filter is an algorithm for 
sequentially updating a linear projection for the system. Among other benefits, 
this algorithm provides a way to calculate exact finite-sample forecasts and the 
exact likelihood function for Gaussian ARMA processes, to factor matrix auto- 
covariance-generating functions or spectral densities, and to estimate vector au- 
toregressions with coefficients that change over time. 

Section 13.1 describes how a dynamic system can be written in a form that 
can be analyzed using the Kalman filter. The filter itself is derived in Section 13.2, 
and its use in forecasting is described in Section 13.3. Section 13.4 explains how 
to estimate the population parameters by maximum likelihood. Section 13.5 ana- 
lyzes the properties of the Kalman filter as the sample size grows, and explains 
how the Kalman filter is related in the limit to the Wold representation and factoring 
an autocovariance-generating function. Section 13.6 develops a smoothing algo- 
rithm, which is a way to use all the information in the sample to form the best 
inference about the unobserved state of the process at any historical date. Section 
13.7 describes standard errors for smoothed inferences and forecasts. The use of 
the Kalman filter for estimating systems with time-varying parameters is investi- 
gated in Section 13.8. 


13.1. The State-Space Representation 
of a Dynamic System 


Maintained Assumptions 


Let y, denote an (” x 1) vector of variables observed at date ¢. A rich class 
of dynamic models for y, can be described in terms of a possibly unobserved 
(r X 1) vector &, known as the state vector. The state-space representation of the 
dynamics of y is given by the following system of equations: 


E44 os Fé, + Ve4i (13.1. 1] 
y, = A’x, + H’§, + w,, [13.1.2] 


I 


where F, A’, and H' are matrices of parameters of dimension (r x r), (” x &k), 
and (n X r), respectively, and x, is a (k x 1) vector of exogenous or predetermined 
variables. Equation [13.1.1] is known as the state equation, and [13.1.2] is known 


372 


as the observation equation. The (r x 1) vector v, and the (” X 1) vector w, are 
vector white noise: 


n . JQ fort = 7 
BON = {s otherwise 133] 


R fort =7T 


E(w,w!) = { [13.1.4] 


0 otherwise, 


where Q and R are (r X r) and (m X n) matrices, respectively. The disturbances 
v, and w, are assumed to be uncorrelated at all lags: 


E(v,wi) = 0 for all ¢ and r. (13.1.5] 


The statement that x, is predetermined or exogenous means that x, provides no 
information about &,,, or w,,, fors = 0,1, 2,... beyond that contained in y,_,, 
Yr-2,- ++» Y,. Thus, for example, x, could include lagged values of y or variables 
that are uncorrelated with &, and w, for all 7. 

The system of [13.1.1] through [13.1.5] is typically used to describe a finite 
series of observations {y,, y2, ..., yz} for which assumptions about the initial 
value of the state vector €, are needed. We assume that &, is uncorrelated with 
any realizations of v, or w,: 


ut 


0 for t 
0 for ¢ 


tt 
aig 
sd 
J 


E(v,€1) 
E(w.&1) 


[13.1.6] 
[13.1.7] 


W 


u 
ar 
wy, 
= 


The state equation [13.1.1] implies that &, can be written as a linear function of 
(é:, V2_ V3.5 059 y,): 


&, = v, + Fv.) + F’v,.. +--+ + Fy, + Fle, [13.1.8] 
fort = 2,3,...,T. 


Thus, [13.1.6] and (13.1.3] imply that v, is uncorrelated with lagged values of &: 
E(v.é) =0 forr=f-1,t-2,...,1. [13.1.9] 


Similarly, 


tt 


E(w,é)=@ forr=1,2,...,T7 (13.1.10] 
E(wy;) = Elw(A'x, + H’&, + w,)’] 

=0 forr=t-—1,t-2,...,1 (13.1.11] 
E(viye) =@ forr=1-1,¢-2,...,1. (13.1.12] 


The system of [13.1.1] through [13.1.7] is quite flexible, though it is straight- 
forward to generalize the results further to systems in which y, is correlated with 
w,.' The various: parameter matrices (F, Q, A, H, or R) could be functions of time, 
as will be discussed in Section 13.8. The presentation will be clearest, however, if 
we focus on the basic form in [13.1.1] through [13.1.7]. 


'See, for example, Anderson and Moore (1979, pp. 105-8). 


13.1. The State-Space Representation of a Dynamic v tem 373 


Examples of State-Space Representations 
Consider a univariate AR(p) process, 
Yar ~ B= OY ~— BH) + bo(W-1 -— HW) te 


+ by (Yr-p+) — B) + bu eee 
o fort=7T 
Ege {° otherwise. 
This could be written in state-space form as follows: 
State Equation (r = p): 
Viet 7 
yr a tad 
Yi-p+2 — (13.1.14] 
" oe West vg yy & Ere 
=10 1 0 0 Yr-1 ~ KB + 0 
0 0 i 0 ese, = rd 0 
Observation Equation (n = 1): 
Ye & 
y=e tll OO M4 TF. (13.1.15] 
Yi-p+t — & 
That is, we would specify 
=: % d2 i @ p-i d 
on 1 0+ 0 0 
ge | Ye 4 F=/0 1 - 0 0 
Yi-pri _ fd 0 0 corer 1 0 
Brat oa? 0 0 
0 0 0 
Yai = : ee ee : 
0 0 0 0 
y= y, A’ = “ x, = 1 
H’=[1 0 ::: O] w,=0 R=0. 


Note that the state equation here is simply the first-order vector difference equation 
introduced in equation [1.2.5]; F is the same matrix appearing in equation [1.2.3]. 
The observation equation here is a trivial identity. Thus, we have already seen that 
the state-space representation [13.1.14] and [13.1.15] is just another way of sum- 
marizing the AR(p) process [13.1.13]. The reason for rewriting an AR(p) process 
in such a form was to obtain a convenient summary of the system’s dynamics, and 
this is the basic reason to be interested in the state-space representation of any 
system. The analysis of a vector autoregression using equation [10.1.11] employed 
a similar state-space representation. 
As another example, consider a univariate MA(1) process, 


y, = w+ &, + O8,_4. [13.1.16] 


374 Chapter 13 | The Kalman Filter 


This could be written in state-space form as follows: - 
State Equation (r = 2): 


fe}- aleJ-be] em 


Observation Equation (n = 1): 


yY=et+ (1 al .* |: [13.1.18] 


that is, 


H’=([l 94] w,=0 R=0O. 
There are many ways to write a given system in state-space form. For example, 
the MA(1) process [13.1.16] can also be represented in this way: 
State Equation (r = 2): 


Ered + Ge, oa 01 &, + 6€,_ | Er4 3.1.19 
[ 0€,4.) E 0 Ee, * OE, 4 [13.1.19] 


Observation Equation (n = 1): 


yew et of 4c] (13.1.20] 
t 


Note that the original MA(1) representation of [13.1.16], the first state-space rep- 
resentation of [13.1.17] and [13.1.18], and the second state-space representation 
of [13.1.19] and [13.1.20] all characterize the same process. We will obtain the 
identical forecasts of the process or value for the likelihood function from any of 
the three representations and can feel free to work with whichever is most con- 
venient. 
More generally, a univariate ARMA(p, q) process can be written in state- 

space form by defining r = max{p, qg + 1}: 

Ye ~ B= OCH — BH) + OR — MH) Ft OK — HH) 3.1.21] 

+e, + Oe, + O28),-2 Ft + Obras 

where we interpret ¢, = 0 for j > p and @ = 0 for j > q. Consider the following 
state-space representation; 
State Equation (r = max{p, g + 1}): 


g: 2 o, ~1 o- € 
1 0 0 0 tet 
Gar = | 0 1 0 O&, + [13.1.22] 
0 0 1 0 ‘ 
Observation Equation (n = 1): ; 
y=et+[1l 8 & + Gi, {13.1.23] 


To verify that [13.1.22] and [13.1.23] describe the same process as [13.1.21], let &, 
denote the jth element of &,. Thus, the second row of the state equation asserts 


13.1. The State-Space Representation of a Dynamic System 378 


that 
Sarat = Sue 
The third row asserts that 
Sane = Sy = bry-as 
and in general the /th row implies that 
Gre io Liv Shee ts 
Thus, the first row of the state equation implies that 
Eire = (hi + Gol + sh? +--+ + OLDE, + bra) 


or 
(1 — @&L - 2 L? Se ae Dat eee = Ere: [13.1.24] 
The observation equation states that ; 
y =pr+1t+ob + 6027 +--- 4+ 6,0, {13.1.25] 
Multiplying (13.1.25] by (1 — oL — $2L? — --- — $,L') and using [13.1.24] 
gives 
(1 ~ $L ~ gL? — +--+ ~ 6 Ly, — w) 


= (1+ 6b + @L? +--+ +6,_,L'"')e,, 
which indeed reproduces [13.1.21]. 

The state-space form can also be very convenient for modeling sums of sto- 
chastic processes or the consequences of measurement error. For example, Fama 
and Gibbons (1982) wanted to study the behavior of the ex ante real interest rate 
(the nominal‘interest rate i, minus the expected inflation rate rf). This variable is 
unobserved, because the econometrician does not have data on the rate of inflation 
anticipated by the bond market. Thus, the state variable for this application was 
the scalar €, = i, — af — mw, where yw denotes the average ex ante real interest 
rate. Fama and Gibbons assumed that the ex ante real rate follows an AR(1) 
process: 

S41 = GE, + Vai. (13.1.26] 
The econometrician has observations on the ex post real rate (the nominal interest 
rate i, minus actual inflation 7,), which can be written as 
i, — 7, = (i, — wt) + (nt - ow) = w+ €, + wy, (13.1.27] 
where w, = (af -— 7,) is the error that people make in forecasting inflation. If 
people form these forecasts optimally, then w, should be uncorrelated with its own 
lagged values or with the ex ante real interest rate. Thus, [13.1.26] and [13.1.27] 
are the state equation and observation equation for a state-space model with r = 
n=1,F = 9¢,y, =i, ~ 7, A’x, = »,H = 1, and w, = (af ~ w,). 

In another interesting application of the state-space framework, Stock and 
Watson (1991) postulated the existence of an unobserved scalar C, that represents 
the state of the business cycle. A set of n different observed macroeconomic var- 
iables (y,,, Ya, - + + » Yar) are each assumed to be influenced by the business cycle 
and also to have an idiosyncratic component (denoted y,,) that is unrelated 
to movements in y, for i # j. If the business cycle and each of the idiosyn- 
cratic components could be described by univariate AR(1) processes, then the 
[(n + 1) x 1] state vector would be 


€,= | Xe [13.1.28] 


376 Chapter 13 | The Kalman Filter 


with state equation 


Cis. 1 Gc 0 0 0 C, Vowel 
Xie+t 0 ¢ O 0 Xi Viet 
X21+1 0 0 & 0 Xa} + Vat [13.1.29] 
Hire J 0 0 0 6, Xun Vir 1 
and observation equation 
yi] fu] fn bo. anf & 
> 0 1 0 fi 
ad ed bcc ea eter cae te [13.1.30] 
Yin. i, Vu 00 - 1 ha 


Thus, 7; is a parameter that describes the sensitivity of the ith series to the business 
cycle. To allow for pth-order dynamics, Stock and Watson replaced C, and y,, in 
(13.1.28] with the (p x 1) vectors (C,, Cys... + Crip as)’ and (Nis Xpress 
Xiu-p+1)’ 8o that &, is an [(m + 1)p x 1] vector. The scalars @; in [13.1.29] are 
then replaced by (p x p) matrices F; with the structure of the matrix F in [13.1.14], 
and [n x (p — 1)] blocks of zeros are added between the columns of H’ in the 
observation equation [13.1.30]. 


13.2. Derivation of the Kalman Filter 


Overview of the Kalman Filter 


Consider the general state-space system [13.1.1] through [13.1.7], whose key 
equations are reproduced here for convenience: 


E41 FE, + way [13.2.1] 
(rx 1) (rxr)(rx 1) (rx 1) 
y = A'sx, + H’'-& + w, (13.2.2] 
(axl) (ax ky(kx 1) (a xr)(rx 1) (ax 1) 
Q fort = 7 
E(v,v}) = 7 (rxn ; {13.2.3] 
0 otherwise 
R fort=T 
E(w, wy) = 4 (xa) : [13.2.4] 
0 otherwise. 
The analyst is presumed to have observed y,, yo...» » Yr, X1. Xz)... Xz. 


One of the ultimate objectives may be to estimate the values of any unknown 
parameters in the system on the basis of these observations. For now, however, 
we will assume that the particular numerical values of F, Q, A, H, and R are known 
with certainty; Section 13.4 will give details on how these parameters can be es- 
timated from the data. 

There are many uses of the Kalman filter. It is motivated here as an algorithm 
for calculating linear least squares forecasts of the state vector on the basis of data 
observed through date ¢, 


é,. le = £(é,41|%,), 
where 
YM (ys, Vents > Vie Xo Xen ey XH) (13.2.5] 
and E(é,,,|¥,) denotes the linear projection of €,,, on Y, and a constant. The 
Kalman filter calculates these forecasts recursively, generating &j9, &2), .. +; 


13.2. Derivation of the Kalman Filter 377 


Enr- , in succession. Associated with each of these forecasts is a mean squared 
error (MSE) matrix, represented by the following (r  r) matrix: 


Pra tle iT = El(€44 - Eee Ena ]. {13.2.6] 


Starting the Recursion 


The recursion begins with &)), which denotes a forecast of &, based on no 
observations of y or x. This is just the unconditional mean of &,, 


Ei = E(&), 
with associated MSE 


Pio = E{{é. — E(&) ME - E(&,)]'}. 
For example, for the state-space representation of the MA(1) aystem given in 
{13.1.17] and [13.1.18], the state vector was 


for which 


Esto = 


1 
32] 
— 
2” 


I-f a 


e(|* ia cl) . Es a [13.2.8] 
where o? = E(e?). 


More generally, if eigenvalues of F are all inside the unit circle, then the 
process for &, in [13.2.1] is covariance-stationary. The unconditional mean of &, 
can be found by taking expectations of both sides of [13.2.1], producing 

E(&:4.1) = F-E(&,), 
or, since &, is covariance-stationary, 

(I, — F)-E(&,) = 0. 
Since unity is not an eigenvalue of F, the matrix (I, — F) is nonsingular, and this 
equation has the unique solution E(&,) = 0. The unconditional variance of & can 


similarly be found by postmultiplying [13.2.1] by its transpose and taking expec- 
tations: 


Er i8ie1) = [RE + vis MEE! + vies] = PEGE) F + Eve ivie1): 
Cross-product terms have disappeared in light of [13.1.9]. Letting 2 denote the 
variance-covariance matrix of €, this equation implies 
x = FXF'+Q, 

whose solution was seen in [10.2.18] to be given by 

vec(%) = [T2 - (F @ F)]7'-vec(Q). 
Thus, in general, provided that the eigenvalues of F are inside the unit circle, the 
Kalman filter iterations can be started with Ene = Oand Py). the (r x r) matrix 
whose elements expressed as a column vector are given by 

vec(Pijo) = [L2 — (F ® F)]~'+vec(Q). 

If instead some eigenvalues of F are on or outside the unit circle, or if the 

initial state & is not regarded as an arbitrary draw from the process implied by 


[13.2.1], then &,), can be replaced with the analyst’s best guess as to the initial 
value of &, where Pi) is a positive definite matrix summarizing the confidence in 


_ Prjo 


378 Chapter 13 | The Kalman Filter 


this guess. Larger values for the diagonal elements of P)\y register greater uncer- 
tainty about the true value of &). 


Forecasting y, 


Given starting values Evo and P,|,, the next step is to calculate analogous 
magnitudes for the following date, &,, and P2),. The calculations for ¢ = 2, 3, 

, T all have the same basic form, so we will describe them in general terms 
for step t; given &,),_, and P,),_,, the goal is to calculate &,,.,), and P,, 1),. 

First note that since we have assumed that x, contains no information about 
£, beyond that contained in Y,_,, 


E(é |x, %,_1) = E(E|%,_1) al Et. 
Next consider forecasting the value of y,: 
Fne- = Evy|x,. Y,_}). 
Notice from [13.2.2] that 
Evy,lx,. &,) = A’x, + H’€,, 
and so, from the law of iterated projections, 


Gir. = A'x, + H’-E(&,|x,, U1) = Ax, + H’E,,_1. [13.2.9] 
From [13.2.2], the error of this forecast is 
y 7 Yy-s = A'x, + H'E, + w, — A'x, - H’E,),-1 = H’(é, - En-1) + Ww, 


with MSE 
El(y, es Fu- OY, e See-1)'] 
a E[H'(é, x En (E, = E,.-1)'H] + E[w,w;]. 
Cross-product terms have disappeared, since 
Elw.(& ~ &y-1)'] = (13.2.11] 
To justify (13.2. 11], recall from [13.1.10] that w, is ee with &,. Further- 
more, since £,,- , is a linear function of Y,_,, by [13.1.11] it too must be uncor- 


related with w,. 
Using [13.2.4] and [13.2.6], equation [13.2.10] can be written 


El(y. — $-O(Y% ~ S-1)'] = HP), iH + R. [13.2.12] 


(13.2.10] 


Updating the Inference About &, 


Next the inference about the current value of &, is updated on the basis of 
the observation of y, to produce 


En =n LE ly., x, ¥,_)) = E(é,|%,). 
This can be evaluated using the formula for updating a linear projection, equation 
(4.5.30]:? 


E,1: = Ent + {E[(E, = E100, = 9ae—0)' [13.2.13] 
x {E[(y, a Fae vO, i Pee— 1) TO! x (y, vt Vine). ‘ 


"Here €, corresponds to Y;, y, corresponds to Y,. and (x;, Y;_,)' corresponds to Y, in equation 
[4.5.30]. 


13.2. Derivation of the Kalman Filter 379 


But 
E{(é, a, E.,-(y, os Frte—1)'t 


= E{[é, = E..—[H'(é, oa En-1) * wil'} (13:2.14] 
= E((, = E,-(& * Ei )'H] 
= Py, 1H 


by virtue of (13. 2.11] and [13.2.6]. Substituting [13.2.14], [13. 2 12}, and [13.2. 9 
into (13.2.13] gives 


En 7 Eni + Pay \H(H'P,,,- H+ R)-'(y, ri A'x, pg H’E,,,_ 1): -{13.2.15] 


The MSE associated with this updated projection, which is denoted P,,,, can 
be found from [4.5.31]: 


Py, = El(& — && ~ &)'] 

El(& — &—(& — Eye~1) 

~ {ELE ~ Sy — Ser) (13.2.16] 
x {E[(y, — Yae- Ye — Seer)! 

x {E[(y, 7 Fae, oF E4-—1)'] 

= P,,-, — P,,-,H(H'P,,-,H + R)~'H’P,,,_.. 


Producing a Forecast of &,. 

Next, the state equation [13.2.1] is used to forecast &,, |: 
aah o E(&,.:1%,) : 

F- E(é,|¥,) + E(v,4 1 |%,) [13.2.17] 

= F&,,, + 0. 
Substituting [13.2.15] into [13.2.17], 
Eiinn = FE,,_, 

+ FP,,,_ ,H(H'P,,,- 1H + R)~'(y, a A’X, = H’E,,,_.). 


{13.2.18] 


The coefficient matrix in [13.2.18] is known as the gain matrix and is denoted K,: 


K, = FP,,_,H(H'P,,-,H_ + R)~', {13.2.19] 
allowing [13.2.18] to be written 
Eeaty = FE, + K,(y, os A’x, a H’E,,-)-. ‘ {13.2.20] 


The MSE of this forecast can be found from [13.2.17] and the state: equation 
{13.2.1]: 
Pray = EE, +1 ~ Eo. nd(Erai = Sind : 
= ERE, + vias — FEq FE + vee — FEa)T  [13.2.21] 
F-E[(é, = EE - é,,)']-F’ + Ely, sivea i] 
FP,,F’ + Q, 


with cross-product terms again clearly zero. Substituting [13.2.16] into [13.2.21] 
produces 


Peete = F[Py-1 — Py—tH(H'P,,-1H + R)'H'P,,,\ JF) + Q. [13.2.22] 


380 Chapter 13 | The Kalman Filter 


Summary and Remarks 


To summarize, the Kalman filter is started with the unconditional mean and 
variance of &,: 


E10 = E(&) 
Pio E{fé, EE: > E(&,)]'}- 
Typically, these are given by &,)) = 0 and vec(P,o) = [I.: — (F ® F)]~!+vec(Q). 
We then iterate on 
Eur = FE, 
+ FP,,_,H(H'P,,_,H + R)~'(y, — A’x, — H’Eq—1) 


(13.2.23] 


and [13.2.22] for ¢ = 1, 2, , T. The value é,, 1, denotes the best forecast of 
£,,, based on a constant and : a linear function of (y,, ¥,-15 +--+ sis Xn Xp-ts es 
x,). The matrix P,, ,), gives the MSE of this forecast. The forecast of Yeu is given 


by 
Geo ite = E(yen Xai, Y,) = AX + HE ary [13.2.24] 
with associated MSE 


El(y,+1 aa Vea rid(Yes 1 c Ve)'] = H'P,.,;,H +R. (13.2.25] 


It is worth noting that the recursion in [13.2.22] could be calculated without 
ever evaluating [13.2.23]. The values for P,,_, in [13.2.22] and K, in [13.2.19] are 
not functions of the data, but instead are determined entirely by the population 
parameters of the process. 

An alternative way of writing the recursion for P,, ,), is sometimes useful. 
Subtracting the Kalman updating equation [13.2.20] from the state equation [13.2.1] 
produces 


Era = Esai, aad F(E, = E,,-1) ao K,(y, - A'x, ~ H’é,,-1) + Via). (13.2.26] 
Further substituting the observation equation [13.2.2] into [13.2.26] results in 

En. - Sau = (F - KAYE, — €,-1) - Kw, + v4. [3.2.27] 
Postmultiplying [13.2.27] by its transpose and taking expectations, 
E{(E,41 a Ea (Era Ts E41) . 

= (F — K,W)E((E, — §se-1(E. — §e4e-1)']F ~ HK;) + K,RK; + Q; 

or, recalling the definition of P,, |, in equation [13.2.6], 

P,,\,. = (F — K,H')P,,-.(F' — HK,;) + K,RK; + Q. _ [13.2.28] 


Equation [13.2.28] along with the definition of K, in [13.2.19] will produce the 
same sequence generated by equation [13.2.22]. 


13.3. Forecasts Based on the State-Space 

Representation 

The Kalman filter computations in [13.2.22] through [13.2.25] are normally cal- 
culated by computer, using the known numerical values of F,Q, A, H, and R along 


with the actual data. To help make the ideas more concrete, however, we now 
explore analytically the outcome of these calculations for a simple example. 


13.3. Forecasts Based on the State-Space Representation 381 


Example— Using the Kalman Filter to Find Exact 
Finite-Sample Forecasts for an MA(1) Process 


Consider again a state-space representation for the MA(1) process: 


State Equation (r = 2): 
t+ 0 0 t tt 
Bi = é ee + fa [13.3.1] 


Observation Equation (n = 1): 


y=u+[l 9 Fal [13.3.2] 
§, = a [13.3.3] 
F = [° | [13.3.4] 
Viel = Ea [13.3.5] 
Q = E 4 [13.3.6] 
Y= y [13.3.7] 
Ai =p [13.3.8] 
x,= 1 [13.3.9] 
H’ = [1 9 [13.3.10] 
w, = 0 (13.3.11] 
R= 0. [13.3.12] 


The starting values for the filter were described in [13.2.7] and [13.2.8]: 


0 
Eto = 3] 
2 0 
Pro = k y 


Thus, from [13.2.24], the period 1 forecast is 
Vyo=e+ H'éi1 = pf, 


with MSE given by [13.2.25]: 


2 
E(y, ~ Yio)? = H'P\oH + R = [1 al 2 |[4 eae 


These, of course, are just the unconditional mean and variance of y. 

To see the structure of the recursion for ¢ = 2,3,... , 7, consider the basic 
form of the updating equation [13.2.23]. Notice that since the first row of F consists 
entirely of zeros, the first element of the vector &,, tj, Will always equal zero, for 
all t. We see why if we recall the meaning of the state vector in [13.3.3]: 


Bs eal! [13.3.13] 


Ene 


382 Chapter 13 | The Kalman Filter 


Naturally, the forecast of the future white noise, é,, ,,, is always zero. The forecast 
of y,,, is given by [13.2.24]: 


é . 
Jen = e+ al a, = pt 6, (13.3.14] 
tle 


The Kalman filter updating equation for the MSE, equation [13.2.21], for 
this example becomes 


, 0 0 0 1 2 0 
Pre iye = FP,,,F + Q = ° AE | + Is HF (13.3.15] 


Thus, Presi is a diagonal matrix of the form 
P oat : [13 3 16] 
reife 0 Pras ’ ove 


where the (2, 2) element of P,, ,;, (which we have denoted by p,, ,) is the same as 
the (1, 1) element of P,,,. Recalling [13.2.6] and [13.3.13], this term has the inter- 
pretation as the MSE of é,,,: 

Pris = E(e, - Eye)? [13.3.17] 
The (1, 1) element of P,, ,;, has the interpretation as the MSE of 2,,\),. We have 
seen that this forecast is always zero, and its MSE in [13.3.16] is o? for all ¢. The 
fact that P,,,), is a diagonal matrix means that the forecast error (e,,, ~ é41),) iS 
uncorrelated with (e, — é,,,). 

The MSE of the forecast of y,,, is given by [13.2.25]: 


E(y.+1 aa Peary)? = H'P,, |,H +R 
o 0 ffl 
= 13,3. 
[1 al Ball +0 [13.3.18] 


= 0? + Op,,). 
Again, the intuition can be seen from the nature of the forecast in [13.3.14]: 
Ea Dente)” = El(u + 64, + 66) — (w+ 6é,,,)]? 
== E(e?,;) + OE (e, = é4,)?, 
which, from [13.3.17], reproduces [13.3.18]. 
From [13.2.23], the series for é,), is generated recursively from 


e]-[o Jf. 


= k a of jee Op ITY pe 66, 1),-1} 


Ene = {o7/[o? + &p,}}- Ly, ~ ho 0é,—1}e-1} [13.3.19] 


starting from the initial value é,, = 0. Note that the value for é,,, differs from the 
approximation suggested in equations [4.2.36] and [4.3.2], 


or 


& = y,— & ~ BE, &) = 0, 
in that [13.3.19] shrinks the inference é, toward zero to take account of the nonzero 
variance p, of é,_,;,-, around the true value e,_,. 
The gain matrix K, in equation [13.2.19] is given by 


_ |0 0 2 Of} 1 1 = 0 
K, = F Alle Olea) = ate i Fall [13.3.20] 


13.3. Forecasts Based on the State-Space Representation 383 


Finally, notice from [13.2.16] that 


mofo a) otmlle I oe 5) 


The (1, 1) element of P,,, (which we saw equals p,,,) is thus given by 


erie Wo? + 62 gas o6"p, 
Pip = O — {1[o? + 6p,]}-0* = o + 6p, (13.3.21] 
The recursion in [13.3.21] is started with p, = o* and thus has the solution 
o79 
Be Te Pe OE oes Oe [13.3.22] 


It is interesting to note what happens to the filter as ¢ becomes large. First 
consider the case when |6| = 1. Then, from [13.3.22], 


lim p,4, = 0, 
and so, from (13.3.17], 


Ene = E; , 
Thus, given a sufficient number of observations on y, the Kalman filter inference 
é,, converges to the true value e,, and the forecast [13.3.14] converges ‘to that of 
the Wold representation for the process. The Kalman gain in [13.3:20] Fanverees 
to (0, 1 
: Cees consider the case when |6| > 1. From (13.3.22}, we have 
_ 0°61 ~ 62) o(1 — 67) 
Poa = TL gis = 9 ge 

and 


No matter how many observations are obtained, it will not be possible to know 
with certainty the value of the nonfundamental innovation e, associated with date 
ton the basis of (y,, y,-1,.--»» Yr). The gain is given by 
re Co ae 
a+ @p, oa ~o(1 - 67) 6 
and the recursion [13.3.19] approaches 
Eur = (1/67) -(y, ~ Bh 68, -1) 
or 
6é,,, a (1/0): (y, ~ BO 6é,_\\,-1). 
Recalling [13.3.14], we thus have 
Prat ~ B= (8) [Cy ~ #) ~ Sarai ~ I 
or 
Dra ite ~ = (1/8) -(y, — HB) - (1/8)? -(y,- ~ p)t (1/8) (y,-2 — p)- 
which again is the AR(%) forecast associated with the invertible MA(1) represen- 
tation. Indeed, the forecasts of the Kalman filter with 6 replaced by 6~' and o? 
replaced by 6c? will be identical for any t; see Exercise 13.5. 


Calculating s-Period-Ahead Forecasts 
with the Kalman Filter 


The forecast of y, calculated in [13.2.24] is an exact finite-sample forecast of 
y, on the basis of x, and Y,_, = (y;_1, Yi-as ~~ Vie X/ats Xia» + y X4)'. TEX, 


384 Chapter 13 | The Kalman Filter 


is deterministic, it is also easy to use the Kalman filter to calculate exact finite- 
sample s-period-ahead forecasts. 
The state equation [13.2.1] can be solved by recursive substitution to yield 


E45 = Fé, + F'~'v,,, + F*~?v,.5 tor kek F'v,4.-) + Vas (13.3.23] 
fors = 1,2,.... 
The projection of &,,, on &, and Y, is given by 
E(Escl&.9,) = F*&,. (13.3.24] 
From the law of iterated projections, 
Er asts = E(é,..1Y,) = F'§,,,. {13.3.25] 


Thus, from [13.3.23] the s-period-ahead forecast error for the state vector is 
Sia. — East = Fé, = E) + Fe'y,., oa F*~*y,,2 [13.3.26] 
pee + Flv twa, 
with MSE 
Pi = PP (FY + FQ)! + FQ)? 113,3.27] 
+--+ + FOF’ +Q. 
To forecast the observed vector y,,,, recall from the observation equation 
that 
Yiay = A’X,4, + H’E, 4, + Wiay. (13.3.28] 
There are advantages if the state vector is defined in such a way that x, is deter- 
ministic, so that the dynamics of any exogenous variables can be represented 
through &,. If x, is deterministic, the s-period-ahead forecast of y is 
Sra sy = Evy,..1%,) = A’Xi 45 + H’E, . 1, (13.3.29] 
The forecast error is 
Yes 7 Sissi = (A’x,,5 + H’E,.., + Wras) oe (A'x,4, + HE, ..1,) 
= H’'(€,., - Sienii) + Wis 
with MSE 
E[(y,+s ~ Jia Yes a Yes’) = H’P,, ,H +R. (13.3.30] 


13.4. Maximum Likelihood Estimation of Parameters 


Using the Kalman Filter to Evaluate 
the Likelihood Function 


The Kalman filter was motivated in Section 13.2 in terms of linear projections. 
The forecasts gue , and ¥,,,_, are thus optimal within the Set of forecasts that are 
linear in (x,, Y,_,), where Y,_, = (y/_1, Via. -- . Yt) X/-1, XV,» , XS)’. If 
the initial state €, and the innovations {w,, v,}7_, are multivariate Gaussian, then 
we can make the stronger claim that the forecasts E,,- 1 9-1 calculated by 
the Kalman filter are optimal among any functions of (x,, Y,_,). Moreover, if &, 
and {w,, v,}/_, are Gaussian, then the distribution of y, conditional on (x,, Y,_;) 
is Gaussian with mean given by [13.2.24] and variance given by [13.2.25]: 
¥1X,, U1 ~ N((A‘x, + H’&),-1), (AP, H + R)); 
that is, 
Fryaxa¥dlX Y,_1) 
= (27)~"*|H'P,,,_,H + R|~'2 
x exp{— iy, _ A'x, a H’§,,,-,)'(H'P,,,-,H + R)"' 
x (y, — A’x, - H’E,,,_.)} fort = 1,2,...,T. 


, 13.44] 


13.4. Maximum Likelihood Estimation of Parameters 385 


From [13.4.1], it is a simple matter to construct the sample log likelihood, 


7 
> 108 fvsx.s,-s(¥ilX Y.-1)- [13.4.2] 


Expression [13.4.2] can then be maximized numerically with respect to the unknown 
parameters in the matrices F, Q, A, H, and R; see Burmeister and Wall (1982) for 
an illustrative application. 

As stressed by Harvey and Phillips (1979), this representation of the likelihood 
is particularly convenient for estimating regressions involving moving average terms. 
Moreover, [13.4.2] gives the exact log likelihood function, regardless of whether 
the moving average representation is invertible. 

As an illustrative example, suppose we wanted to estimate a bivariate regres- 
sion model whose equations were 

Vie = AX, + Uy 

yz, = aX, i Uz, 
where x, is a (k x 1) vector of exogenous explanatory variables and a, and a, are 
(k X 1) vectors of coefficients; if the two regressions have different explanatory 
variables, the variables from both regressions are included in x, with zeros appro- 
priately imposed on a, and a). Suppose that the disturbance vector follows a bi- 
variate MA(1) process: , 


[i = | Eu +[¢ 44 a 
U2, £21 2, 9 E2~1 . 


with (€,,, €2,)' ~ i.i.d. N(0, ). This model can be written in state-space form by 
defining 


Ey, 000 0 Elst 
= Ex, = 000 0 = | F204 
Ge EP hao ool 0 
E2y—1 010 0 0 
7, 2 0 0 
= 02, On 0 0 a aj 
Se 0 300-0) | 
0 0 00 
oY as 1 0 6 912 = 
oe E i Bite) 
where o;, = E(e;,e,). The Kalman filter iteration is started from 
0 a, FT. O O 
a Gz, Or 0 0 
Ea _ 0 Pro 0 0 G1, 2 2 
0 0 0 a, oy 


Maximization of [13.4.2] is started by making an initial guess as to the nu- 
merical values of the unknown parameters. One obvious way to do this is to regress 
y,, on the elements of x, that appear in the first equation to get an initial guess for 
a,. A similar OLS regression for y. yields a guess for a,. Setting 6,, = 6,. = 
62, = 62. = Oinitially, a first guess for O could be the estimated variance-covariance 
matrix of the residuals from these two OLS regressions. For these initial numerical 
values for the population parameters, we could construct F, Q, A, H, and R from 
the expressions just given and iterate on [13.2.22] through [13.2.25] for ¢ = 1, 2, 

., T — 1, The sequences {€,-.}721 and {P,,,_;}72, resulting from these iter- 


386 Chapter 13 | The Kalman Filter 


ations could then be used in [13.4.1] and [13.4.2] to calculate the value for the log 
likelihood function that results from these initial parameter values. The numerical 
optimization methods described in Section 5.7 can then be employed to make better 
guesses as to the value of the unknown parameters until [13.4.2] is maximized. As 
noted in Section 5.9, the numerical search will be better behaved if 2 is param- 
eterized in terms of its Cholesky factorization. 

As a second example, consider a scalar Gaussian ARMA(1, 1) process, 


y,-— Mh = O(Y-1 — Bw) + &, + G8,_,, 


with €, ~ i.i.d. N(0, 07). This can be written in state-space form as in [13.1.22] 
and [13.1.23] with r = 2 and 


_|¢ 0 Ae aes _|o* 0 
r=[f oo] ms ['] e=[6 | 


A'=y x,=1 We=[l 9 R=0O 


, aH - a baa $o7/(1 — $7) 
eri Ne Leo - 6) oa - 0) | 


This value for P,,) was obtained by recognizing that the state equation [13.1.22] 
describes the behavior of &, = (z,, Z;.1,.. +» Zr-p41)', where z, = @)z,_, + 
o2Z,-2 + +++ + ,2,-, + &, follows an AR(r) process. For this example, r = 2, 
so that P,,, is the variance-covariance matrix of two consecutive draws from an 
AR(2) process with parameters ¢, = ¢ and ¢, = 0. The expressions just given 
for F, Q, A, H, and R are then used in the Kalman filter iterations. Thus, expression 
[13.4.2] allows easy computation of the exact likelihood function for an ARMA(p, q) 
process. This computation is valid regardless of whether the moving average pa- 
rameters satisfy the invertibility condition. Similarly, expression [13.3.29] gives the 
exact finite-sample s-period-ahead forecast for the process and [13.3.30] its MSE, 
again regardless of whether the invertible representation is used. 

Typically, numerical search procedures for maximizing [13.4.2] require the 
derivatives of the log likelihood. These can be calculated numerically or analytically. 
To characterize the analytical derivatives of [13.4.2], collect the unknown param- 
eters to be estimated in a vector 6, and write F(@), Q(@), A(®@), H(@), and R(6). 
Implicitly, then, &,,_ (8) and P,,,_ (0) will be functions of @ as well, and the 
derivative of the log of [13.4.1] with respect to the ith element of 6 will involve 
a& ;,(0)/08, and dP, ,(@)/d0,. These derivatives can also be generated recur- 
sively by differentiating the Kalman filter recursion, [13.2.22] and [13.2.23], with 
respect to 6,; see Caines (1988, pp. 585-86) for illustration. 

For many state-space models, the EM algorithm of Dempster, Laird, and 
Rubin (1977) offers a particularly convenient means for maximizing [13.4.2], as 
developed by Shumway and Stoffer (1982) and Watson and Engle (1983). 


Identification 


Although the state-space representation gives a very convenient way to cal- 
culate the exact likelihood function, a word of caution should be given. In the 
absence of restrictions on F, Q, A, H, and R, the parameters of the state-space 
representation are unidentified—more than one set of values for the parameters 
can give rise to the identical value of the likelihood function, and the data give us 
no guide for choosing among these. A trivial example is the following system: 


13.4. Maximum Likelihood Estimation of Parameters 387 


State Equation (r = 2): 


E41 = Ea [13.4.3] 
e2tei 
Observation Equation (n = 1): 
Y, = €y + €9, [13.4.4] 


ot 0 
0 o 
that y, is white noise, with mean zero and variance given by (oj + 3). The reader 
is invited to confirm in Exercise 13.4 that the log of the likelihood function from 
[13.4.1] and [13.4.2] simplifies to 


Here, F = 0,Q = |: A'=0,H’ =[1 1], andR = 0. This model asserts 


log fy p¥y-y...-. v,Yr raw es yi) 
oy [13.4.5] 


7 
= ~(T/2) log(2m) — (T/2) log(a} + 03) - > y2/[2(o7 + 03). 


Clearly, any values for 07 and a3 that sum to a given constant will produce the 
identical value for the likelihood function. 

The MA(1) process explored in Section 13.3 provides a second example of 
an unidentified state-space representation. As the reader may verify in Exercise 
13.5, the identical value for the log likelihood function [13.4.2] would result if 6 
is replaced by 6~' and o? by 67a. 

These two examples illustrate two basic forms in which absence of identifi- 
cation can occur. Following Rothenberg (1971), a model is said to be globally 
identified at a particular parameter value 0, if for any value of @ there exists a 
possible realization Y ; for which the value of the likelihood at 6 is different from 
the value of the likelihood at 0). A model is said to be locally identified at 0, if 
there exists a 6 > 0 such that for any value of 6 satisfying (@ — 0))'(@ — @,) < 6, 
there exists a possible realization of Y, for which the value of the likelihood at 0 
is different from the value of the likelihood at @,. Thus, global identification implies 
local identification. The first example, [13.4.3] and [13.4.4], is neither globally nor 
locally identified, while the MA(1) example is locally identified but globally un- 
identified. 

Local identification is much easier to test for than global identification. Roth- 
enberg (1971) showed that a model is locally identified at @, if and only if the 
information matrix is nonsingular in a neighborhood around ®,. Thus, a common 
symptom of trying to estimate an unidentified model is difficulty with inverting the 
matrix of second derivatives of the log likelihood function. One approach to check- 
ing for local identification is to translate the state-space representation back into 
a vector ARMA model and check for satisfaction of the conditions in Hannan 
(1971); see Hamilton (1985) for an example of this approach. A second approach 
is to work directly with the state-space representation, as is done in Gevers and 
Wertz (1984) and Wall (1987). For an illustration of the second approach, see 
Burmeister, Wall, and Hamilton (1986). 


Asymptotic Properties of Maximum Likelihood Estimates 


If certain regularity conditions are satisfied, then Caines (1988, Chapter 7) 
showed that the maximum likelihood estimate 6, based on a sample of size T is 
consistent and asymptotically normal. These conditions include the following: (1) 
the model must be identified; (2) eigenvalues of F are all inside the unit circle; (3) 


388 Chapter 13 | The Kalman Filter 


apart from a constant term, the variables x, behave asymptotically like a full-rank 
linearly indeterministic covariance-stationary process; and (4) the true value of 6 
does not fall on a boundary of the allowable parameter space. Pagan (1980, Theo- 
rem 4) and Ghosh (1989) examined special eases of state-space models for which 


VT$¥3.7(6, — 0) > N(0, I,), [13.4.6] 


where a is the number of elements of @ and $3, 7 is the (a x a) information matrix 
for a sample of size T as calculated from second derivatives of the log likelihood 


function: 
1 of & #& log f(y,1x,, Y,—1; 8) 
Seo.7 = 43: a0 30’ ae nee 


A common practice is to assume that the limit of $25 , as T—> © is the same as 
the plim of 


T a2 . 
i Ss a log f(y,1X,. M15 1) : [13.4.8] 
(1 08 30’ ea 07 
which can be calculated analytically or numerically by differentiating [13.4.2]. 
Reported standard errors for 6, are then square roots of diagonal elements of 


(VT)($20,1) 7". 


$20.7 = - 


Quasi-Maximum Likelihood Estimation 


Even if the disturbances v, and w, are non-Gaussian, the Kalman filter can 
still be used to calculate the linear projection of y,,,,on past observables. Moreover, 
we can form the function [13.4.2] and maximize it with respect to @ even for non- 
Gaussian systems. This procedure will still yield consistent and asymptotically Nor- 
mal estimates of the elements of F, Q, A, H, and R, with the variance-covariance 
matrix constructed as described in equation [5.8.7]. Watson (1989, Theorem 2) 
presented conditions under which the quasi-maximum likelihood estimates satisfy 

VT(6r - %) > N(O, [$2095)$20)~'). [13.4.9] 
where $2, is the plim of [13.4.8] when evaluated at the true value 0, and $,p is 
the outer-product estimate of the information matrix, 


Sop = plim (1/T) > [h(@,), Y,)][h(O,, Y,)]’, 


where 


a log f(y1X,, Yr-15 0) 


h(®, Y,) = 20 


13.5. The Steady-State Kalman Filter 


Convergence Properties of the Kalman Filter 


Section 13.3 applied the Kalman filter to an MA(1) process and found that 
when |6| < 1, 


13.5. The Steady-State Kalman Filter 389 


whereas when |6| > 1, 
; _ |e 0 
ii Praite = i a?(6? on sa 


0 
lim K, = : 
aE: BA 


It turns out to be a property of a broad class of state-space models that the sequences 
{P,+1)}7-, and {K,}7., converge to fixed matrices, as the following proposition 
shows. 


Proposition 13.1: Let F be an (r x r) matrix whose eigenvalues are all inside the 
unit circle, let H' denote an arbitrary (n x r) matrix, and let Q and R be positive 
semidefinite symmetric (r x r) and (n x n) matrices, respectively. Let {P,.1;}7= 
be the sequence of MSE matrices calculated by the Kalman filter, 

Pyaite = F[P,,-1 = P,,-,H(H'P,,,_ ,H + R)~'H’P,,,- JF’ + Q, [13.5.1] 
where iteration on [13.5.1] is initialized by letting P,\, be the positive semidefinite 
(r x r) matrix satisfying 

vec(P,),) = [I — (F @ F)]~'-vec(Q). [13.5.2] 
Then {P,, 47; is a monotonically nonincreasing sequence and converges as T > ~ 
to a steady-state matrix P satisfying 


P = F[P — PH(H’PH + R)-'H’P]F’ + Q. [13.5.3] 
Moreover, the steady-state value for the Kalman gain matrix, defined by 
K = FPH(H'PH + R)7', [13.5.4] 


has the property that the eigenvalues of (F — KH’) all lie on or inside the unit circle. 


The claim in Proposition 13.1 that P,,,), = P,,-; means that for any real 
(r x 1) vector h, the scalar inequality h’P,, ,},h = h’P,,_,h holds. 

Proposition 13.1 assumes that the Kalman filter is started with P,)g equal to 
the unconditional variance-covariance matrix of the state vector &,. Although the 
sequence {P,, ,},} converges to a matrix P, the solution to [13.5.3] need not be 
unique; a different starting value for P,)) might produce a sequence that converges 
to a different matrix P satisfying [13.5.3]. Under the slightly stronger assumption 
that either Q or R is strictly positive definite, then iteration on [13.5.1] will converge 
to a unique solution to [13.5.3], where the starting value for the iteration P1j) can 
be any positive semidefinite symmetric matrix. 


Proposition 13.2: Let F be an (r x r) matrix whose eigenvalues are all inside the 
unit circle, let H' denote an arbitrary (n x r) matrix, and let Q and R be positive 
semidefinite symmetric (r x r) and (n x n) matrices, respectively, with either Q or 
R strictly positive definite. Then the sequence of Kalman MSE matrices {P,+1):}7=1 
determined by [13.5.1] converges to a unique positive semidefinite steady-state matrix 
P satisfying [13.5.3], where the value of P is the same for any positive semidefinite 
symmetric starting value for P\\y. Moreover, the steady-state value for the Kalman 
gain matrix K in [13.5.4] has the property that the eigenvalues of (F — KH’) are all 
strictly inside the unit circle. 


We next discuss the relevance of the results in Propositions 13.1 and 13.2 
conceming the eigenvalues of (F — KH’). 


390 Chapter 13 | The Kalman Filter 


Using the Kalman Filter to Find the Wold Representation 
and Factor an Autocovariance-Generating Function 


Consider a system in which the explanatory variables (x,) consist solely of a 
constant term. Without loss of generality, we simplify the notation by assuming 
that A‘x, = 0. For such systems, the Kalman filter forecast of the state vector can 
be written as in [13.2.20]: 


Eat = FE,,-4 + K,(y, ~ H’é,,,_,). [13.5.5] 


The linear projection of y,, , on the observed finite sample of its own lagged values 
is then calculated from 


Set = E(yisily,. Yretye sso Nt) = HE, 4 thn [13.5.6] 
with MSE given by [13.2.25]: 
El(ys 41 = Fre (Yr01 ry aT | a H'P,, 1H +R. [13.5.7] 


Consider the result from applying the Kalman filter to a covariance-stationary 
process that started up at a time arbitrarily distant in the past. From Proposition 
13.1, the difference equation [13.5.5] will converge to 


Erin = FE), +e K(y, = H'€,),-1), [13.5.8] 
with K given by [13.5.4]. The forecast [13.5.6] will approach the forecast of y,, ; 
based on the infinite history of its own lagged values: 

E(y,sil¥e Yew +++) = WB siy- [13.5.9] 
The MSE of this forecast is given by the limiting value of [13.5.7], 
Effyis. - Evysily:, Yai. Mya. ~ Ecy,sily,, Yea I} 13.5.10 
= H'PH + R, 310] 


where P is given by [13.5.3]. 
Equation [13.5.8] can be written 


Bei = (F - KH')LE,. 1, + Ky, [13.5.1] 
for L the lag operator. Provided that the eigenvalues of (F — KH’) are all inside 
the unit circle, [13.5.11] can be expressed as 

Baie =, [I, a (F = KH')L]~'Ky, 
= [I + (F — KH')L + (F — KH’)?L? + (F — KH’)'L* + -- -]Ky,. 
[13.5.12] 


Substituting [13.5.12] into [13.5.9] gives a steady-state rule for forecasting y,,, as 
a linear function of its lagged values: 


E(ysil¥. ¥-1---) = HEL — (F - KH’)L]"'Ky,. — [13.5.13] 
Expression [13.5.13] implies a VAR(«) representation for y, of the form 
Yo. = HL, — (F — KH’)L]~'Ky, + €,41, [13.5.14] 
where 

Ere, = Yi. Eiys ily, Yi-ts-- .). [13.5.15] 
Thus, €,,, is the fundamental innovation for y,, 1. Since €,,, is uncorrelated with 
y,-; for any j = 0, it is also uncorrelated with €,_; = y,-; — Ey,-; | Yp-j~ts Yr j—29 
. ..) for any j = 0. The variance-covariance matrix of €,, , can be calculated using 

[13.5.15] and [13.5.10]): 


E(€ 4418741) = Effy,s1 ~ E(y,sily, Yr-is +> | 


* [yar — E(ysilyn yen. II [13.5. 16] 
H'PH + R. 


13.5. The Steady-State Kalman Filter 391 


Note that [13.5.14] can be written as 
{I ~ H'[I, rs (F vz KH')L]~'KL}y,4; = Fri. [13.5.17] 
The following result helps to rewrite the VAR() representation [13.5.17] in the 
Wold MA(*) form. 


Proposition 13.3: Let F,H', and K be matrices of dimension (r x r),(n X r), and 
(r Xx n), respectively, such that eigenvalues of F and of (F — KH’) are all inside 
the unit circle, and let z be a scalar on the complex unit circle. Then 


{I, + H'(l, — Fz)"'Kz}I, — H'[I, — (F — KH')z]~'Kz} = I, 


Applying Proposition 13.3, if both sides of [13.5.17] are premultiplied by 

(I, + H'(I, — FL)~'KL), the result is the Wold representation for y: : 
Yu1 = {l, + HL, — FL)“'KL}e,, ;. [13.5.18] 

To summarize, the Wold representation can be found by iterating on [13.5.1] 
until convergence. The steady-state value for P is then used to construct K in 
[13.5.4]. If the eigenvalues of (F — KH’) are all inside the unit circle, then the 
Wold representation is given by [13.5.18]. 

The task of finding the Wold representation is sometimes alternatively posed 
as the question of factoring the autocovariance-generating function of y. Applying 
result [10.3.7] to [13.5.16] and [13.5.18], we would anticipate that the autocovar- 
iance-generating function of y can be written in the form 

G,(z) = {I, + H'(I, — Fz)~'Kz}{H'PH + R} 
x {I, + K'(, — F’z7!)7'Hz7 4}. 


Compare [13.5.19] with the autocovariance-generating function that we would have 
written down directly from the structure of the state-space model. From [10.3.5], 
the autocovariance-generating function of & is given by 
G,(z) = [I, — Fz]-'Q[I, — F’z~'J-', 

while from [10.3.6] the autocovariance-generating function of y, = H’&, + w, is 

Gy(z) = H'[I, — Fz]~'Q[I, — F’z~']“'H + R. [13.5.20] 
Comparing [13.5.19] with [13.5.20] suggests that the limiting values for the Kalman 
gain and MSE matrices K and P can be used to factor an autocovariance-generating 
function. The following proposition gives a formal statement of this result. 


[13.5.19] 


Proposition 13.4; Let F denote an (r x r) matrix whose eigenvalues are all inside the 
unit circle; let Q and R denote symmetric positive semidefinite matrices of dimension 
(r x r) and (n X n), respectively; and let H' denote an arbitrary (n x r) matrix. Let 
P be a positive sernidefinite matrix satisfying [13.5.3] and let K be given by [13.5.4]. 
Suppose that eigenvalues of (F — KH’) are all inside the unit circle. Then 

H'[I, — Fz]~'Q[I, — F’z~']-'H +R i4§ 3 
= {L, + H'(I, — Fz)~'Kz}H’PH + RHI, + K'(I, — F’z7!)~'Hz7'}. e821) 

A direct demonstration of this claim is provided in Appendix 13.A at the end 
of this chapter. 

As an example of using these results, consider observations on a univariate 
AR(1) process subject to white noise measurement error, such as the state-space 
system of [13.1.26] and [13.1.27] with ~ = 0. For this system, F = ¢, Q = oa}, 
A = 0, H = 1, and R = o4,. The conditions of Proposition 13.2 are satisfied as 
long as |¢| < 1, establishing that |F - KH| = |@ — K| < 1. From equation 


392 Chapter 13 | The Kalman Filter 


[13.5.14], the AR(*) representation for this process can be found from 
Yor = [1 — (6 — KL] Ky, + eran 
which can be written 
[1 -(¢- K)L)y, 41 = Ky, + [1 - (¢- K)L]e,41 
or 
Vist = OY, + Err — (6 — Ke. [13.5.22] 
This is an ARMA(1, 1) process with AR parameter given by @ and MA parameter 


given by —(@ — K). The variance of the innovation for this process can be cal- 
culated from [13.5.16]: 


E(e2,,) = 0% + P. [13.5.23] 
The value of P can be found by iterating on [13.5.1]: 
Py tye = OP ,,-1 my P3,- (ow a Py,-1)] + oy [13.5.24] 


@?P.-1 0! (OW BE Pi 1) + oy, 
starting from P,j,) = o%/(1 — 7), until convergence. The steady-state Kalman 
gain is given by [13.5.4]: 

K = $P/(o3, + P). [13.5.25] 
As a second example, consider adding an MA(q,) process to an MA(q2) 


process with which the first process is uncorrelated at all leads and lags. This could 
be represented in state-space form as follows: 


State Equation (r = q, + 42 + 2): 


Uns) uy es 
u,~ 
0 0 0 oO] f ' 
Uy egi + I, 00 0 A, — 4, + 9 [13.5.26] 
Vat 0 60 (0' (0 v, Veai 
” 0 0 I, 0 uA 
eee a taa+2) x(a +42*2 | yg, 0 


Observation Equation (n = 1): 


‘a 1. [13.5.27] 


Vigo 
Note that all eigenvalues of F are equal to zero. Write equation [13.5.18] in the 
form 


Yio1 = (1, + WH’, — FL)“'KL}e,,1 [13.5.28] 
= (1, + H'(I, + FL + F2L?2 + FL? + ++ -)KL}e, 41: 


Let q = max{q,, qo}, and notice from the structure of F that F7*+/ = 0 forj = 1, 
2,.... Furthermore, from [13.5.4], F7K = F¢’*+'PH(H'PH + R)~! = 0. Thus 


13.5. The Steady-State Kalman Filter 393 


[13.5.28] takes the form 
Yor ={1+ HW, + FL + FPL? + FPL? 
tee t+ FOo'LI-'\KL}e,, | [13.5.29] 
= {1+ OL + OL? +--+ + O,L%e,,,, 
where 
6,= H'F"'K ~~ forj = 1,2,...,4¢. 
This provides a constructive demonstration of the claim that an MA(q,) process 
plus an MA(qg,) process with which it is uncorrelated can be described as an 
MA(max{q,, g2}) process. 

The Kalman filter thus provides a general algorithm for finding the Wold 
representation or factoring an autocovariance-generating function—we simply it- 
erate on [13.5.1] until convergence, and then use the steady-state gain from [13.5.4] 
either in [13.5.14] (for the AR(«) form) or in [13.5.18] (for the MA(e) form). 

Although the convergent values provide the Wold representation, for any 
finite ¢ the Kalman filter forecasts have the advantage of calculating the exact 
optimal forecast of y,,, based on a linear function of {y,, y,_,,.... Yi} 


13.6. Smoothing 


The Kalman filter was motivated in Section 13.2 as an algorithm for calculating a 
forecast of the state vector &, as a linear function of previous observations, 


Enns = EcE|Y,_.), [13.6.1] 
where ,_) = (¥/1, Yreas + + + Vis Xe-t> X-2s + + +» Xi)’. The matrix P,,,_, rep- 
resented the MSE of this forecast: 

Pri t = EL(é, 7 E,-(, ~ E4r=1)')- [13.6.2] 


For many uses of the Kalman filter these are the natural magnitudes of interest. 
In some settings, however, the state vector &, is given a structural interpretation, 
in which case the value of this unobserved variable might be of interest for its own 
sake. For example, in the model of the business cycle by Stock and Watson, it 
would be helpful to know the state of the business cycle at any historical date t. 
A goal might then be to form an inference about the value of &, based on the full 


set of data collected, including observations On y,, ¥,41,- +5 Yrs Xo Xaty eee 
x,. Such an inference is called the smoothed estimate of &,, denoted 
E,7 = EE,|Y7). [13.6.3] 


For example, data on GNP from 1954 through 1990 might be used to estimate the 
value that € took on in 1960. The MSE of this smoothed estimate is denoted 
Pir a El(é, = Er \(é, ~ E17)’. ; [13.6.4] 
In general, P,,, denotes the MSE of an estimate of &, that is based on observations 
of y and x through date 7+. 
For the reader’s convenience, we reproduce here the key equations for the 
Kalman filter: 


En, =. E-1 + P,,- ,H(H'P,,_,H + R)-'(y, ~ A’'x, ~ H’é,;,-1) [13.6.5] 


Eats = Fé,, [13.6.6] 
Pu = Piet ss P,,,-.H(H'P,),_,H ae R)-'H’P,,-; [13.6.7] 
Pris a FP,,F' = Q. [13.6.8] 


394 Chapter 13 | The Kalman Filter 


Consider the estimate of &, based on observations through date t, E,1,. Suppose 
we were subsequently told the true value of &,,,. From the formula for updating 
a linear projection, equation [4.5.30], the new estimate of &, could be expressed 
as? 


EE | E41, %,) = E,, + {EL (E, m En (Era =z E,.4).)'Tt 
x {E[(E,41 a Ernie) Brat Ei) [13.6.9] 
x (E41 _ £41): 
The first term in the product on the right side of [13.6.9] can be written 
El(é, a En Ea = E41)’ = EL(é, * E,,, (FE, n/n FE,.,)'], 


by virtue of [13.2.1] and [13.6.6]. Furthermore, v,,, is uncorrelated with &, and 
é,. Thus, 


E(é, = Ei (Era = £44)’ = El(é, = E,,(, - é,,,)'F'] _ P,,F'. [13.6.10] 
Substituting [13.6.10] and the definition of P,, ,,, into [13.6.9] produces 
EE lE a1 %,) = é,, + PY EP (Er > E411). 


Defining 
J, = Py FP Ay. [13.6.11] 
we have 
E(E|E, 415 Y,) = Eu, + Sra - E,4110- [13.6.12] 
Now, the linear projection in [13.6.12] turns out to be the same as 
EE E41, 97) [13.6.13] 


that is, knowledge of y,,; or x,,; for j > 0 would be of no added value if we already 
knew the value of &,,,. To see this, note that y,,; can be written as 


Yay = A’X, 4 + H'(F’-'&,, | + Fi-?v,,5 + F/-3y,, 5 gers eek Via) t Wray: 
But the error 
&, - LENE Y,) [13.6.14] 


is uncorrelated with &,, ,, by the definition of a linear projection, and uncorrelated 
with X,4j, Wr4j> Vr4j> Ver j—ir - ++ > Ye42 under the maintained assumptions. Thus, 
the error [13.6,14] is uncorrelated with y,,, or x,,; for j > 0, meaning that [13.6,13] 
and [13.6,12] are the same, as claimed: 


BEE U7) = En, + F(a — en [13.6.15] 


It follows from the law of iterated projections that the smoothed estimate, 
E(é,|Y,), can be obtained by projecting [13.6.15] on ¥Y,. In calculating this pro- 
jection, we need to think carefully about the nature of the magnitudes in [13.6.15]. 
The first term, &,,,, indicates a particular exact linear function of ¥,; the coefficients 
of this function are constructed from population moments, and these coefficients 
should be viewed as deterministic constants from the point of view of performing 
a subsequent projection. The projection of En, on Y; is thus still Eun this same 


"Here, Y, = &,, Y2 = &,.,, and Y, = Y. 


13.6. Smoothing 395 


linear function of ¥,—we can’t improve on a perfect fit!* The term J, in [13.6.11] 
is also a function of population moments, and so is again treated as deterministic 
for purposes of any linear projection. The term é0, jis another exact linear function 
of ¥,. Thus, projecting [13.6. Pl on ¥Y; turns out to be trivial: 


Eé, |U7) = E,, + FEE. 19D — Ea 


or 
Ear = E,, + Essie - E411): [13.6. 16] 


Thus, the sequence of smoothed estimates {E,, 37. , is calculated as follows. 
First, the Kalman filter, [13.6.5] to [13.6.8], is calculated and the sequences 
{E,,}7. is {Es aye} Zot {Pade and {Pr iydizo! are stored. The smoothed estimate 
for the final date in the pani Eri is just the last entry in {E437 1. Next, [13.6.11] 
is used to generate {J,}7-'. From this, [13.6.16] is used for t = T — 1 to calculate 

Err = Er-yr-1 + Ir- (Err — Enr-,). 
Now that Err has been calculated, [13.6.16] can be used for ¢ = T — 2 to 
evaluate 
Ep a7 = = £7. 27-2 + Jr- Er ra E,_ 1\r~2) 
Proceeding backward through the sample in this fashion permits calculation of the 
full set of smoothed estimates, Ey An e 

Next, consider the mean squared error associated with the smoothed estimate. 

Subtracting both sides of [13.6.16] from &, produces 

é, = Enr = é, = E,, = Vieis yr + Jere 
or 

g, - E nr + Jer = &,- E., + eat 
Multiplying this equation by its transpose and taking expectations, 


E(&, = En, = E47)'] + LEE 478i) I 
= El(é, a EE, ~ é,)'] + FEE 4 Ese 0). 
The cross-product terms have disappeared from the left side because Ee uri is a 
linear function of Y; and so is uncorrelated with the projection error &, — eee 
Similarly, on the right side, é,, 1, is uncorrelated with —, — En. 
' Equation [13.6.17] states that 


Pur fs Pi, + JA- Els. yr éie 17] +t 12 (Cae See) Pe [13.6.18] 


[13.6.17] 


*The law of iterated projections states that 
E(é,|%,) = E{E(E|9,)|Y)). 


The law of iteraled projections thus allows us to go from a larger information set to a smaller. OF 
coursc. the same operation does not work in reverse: 


LEN) # ELE (EY )19 71. 


We cannot go from a smaller information set to a larger. 
An example may clarify this point. Let y, be un i.i.d. zero-mean sequence with 


&=H + Yar 
Then 


E(ély) = # 


and 


EEE ly dy yo) = Eluly yo) = 2 


396 Chapter 13 | The Kalman Filter 


The bracketed term in [13.6.18] can be expressed as 
~ El (Es4y7€/4117) + El (b+ 1-804.) ; 
sea {E[(E 41871] ae El(E.. 78/411) }} “7 {E[(E 41804 1)] ae El r4 1824.40) 


{E[(E,41 + Err = E,.u7)'} = {E[(E,41 Eo. )(Era1 va E,.4)')} 
Pra ir = Posie. 


ut 


Ut 


[13.6.19] 


The second-to-last equality used the fact that 
FUE 4184117] = El(E,41 a E.ur + E..i7)bi4 7) 

= E[(Er41 - Enea + EE, 178/417) 

= EE. r8i+1\71. 
since the projection error (E41 —- Eins is uncorrelated with Eur Similarly, 
E(E alsa) = EE, 41804 1). Substituting [13.6. 19] into [13.6.18] establishes that 
the smoothed estimate é, has MSE given by 

Pyr = Pi, + DP rear = Piss dJr- [13.6.20] 

Again, this sequence is generated by moving through the sample backward starting 
witht = T — 1. 


13.7. Statistical Inference with the Kalman Filter 


The calculation of the mean squared error 


Pi = E[(é, = EE rs E24’) 
described earlier assumed that the parameters of the matrices F, Q, A, H, and R 
were known with certainty. Section 13.4 showed how these parameters could be 
estimated from the data by maximum likelihood. There would then be some sam- 
pling uncertainty about the true values of these parameters, and the calculation of 
P,,, would need to be modified to obtain the true mean squared errors of the 
smoothed estimates and forecasts.° 

Suppose the unknown parameters are collected in a vector 8. For any given 
value of 0, the matrices F(@), Q(@), A(@), H(®), and R(6) could be used to construct 
Er(0) and P,,7(8) in the formulas presented earlier; for + = T, these are the 
smoothed estimate and MSE given in [13.6.16] and [13.6.20], respectively; while 
for + > T, these are the forecast and its MSE in [13.3.25] and [13.3.27]. Let 
Me = (yp, Yrets- s Yus X 7s Xp-as-- Xj)’ denote the observed data, and let 
0, denote the true value of @. The earlier derivations assumed that the true value 
of @ was used to construct E.17(80) and P,;7-(,)- 

Recall that the formulas for updating a linear projection and its MSE, [4.5.30] 
and [4.5.31], yield the conditional mean and conditional MSE when applied to 
Gaussian vectors; see equation [4.6.7]. Thus, if {v,}, {w,}, and &, are truly Gaussian, 
then the linear projection &,)7(@») has the interpretation as the expectation of &, 
conditional on the data, 


E1r(8) = E(E-|Y,); [13.7.1] 
while P,,(®)) can be described as the conditional MSE: 


P,,7(90) a EXlé, = E47 (OE, — E47 (8,)]'|Yr}- [13. 7. 2] 


Let 6 denote an estimate of 6 based on Y,, and let E.7() denote the estimate 
that results from using 6 to construct the smoothed inference or forecast in [13.6. 16] 


*This discussion is based on Hamilton (1986). 


13.7. Statistical Inference with the Kalman Filter 397 


or [13.3.25]. The conditional mean squared error of this estimate is 
Elé, — €47(O)|[E, - E,,7(6)]'|97} 
= E{lé, - E,17(80) + &)r(Ou) rs é,7(6)] 
x [&, - E,)7(80) + E47 (8u) = E.r(6))' |r} [13.7.3] 
= E{[é, - E47 (OIE, a E,,r(8,)]'|Y7} 
+ EE 7 (0) — 76) E17(80) — &)r(6))' 193. 
Cross-product terms have disappeared from [13.7.3], since 
E{(é,7(80) — &r(6)E- — €.7(O0)]'| Yr} 
a [&,)7(80) = E,,7(6)] x E{[é, - E447 (00))' 197} 
- [€.;7(80) - é,,r(8)] x 0. 
The first equality follows because &,)7(0,) and &,;7(6) are known nonstochastic 


functions of Y;, and the second equality is implied by [13.7.1]. Substituting [13.7.2] 
into [13.7.3] results in 


Eg, ~ §yr@ ME. — Er O19} — 
bs! P,,7(0,) + E{(E,)7(80) - E47 (6))[E,)7 (80) = E,)7(8)]'|Yr}. 


Equation [13.7.4] decomposes the mean squared error into two components. 
The first component, P,,(@y), might be described as the “filter uncertainty.” This 
is the term calculated from the smoothing iteration [13.6.20] or forecast MSE 
[13.3.27] and represents uncertainty about &, that would be present even if the 
true value @, were known with certainty. The second term in [13.7.4], 


E{{E 7 (8s) ~ §yr()ME 70) ~ §,,r(6))'}, 
might be called “parameter uncertainty.” It reflects the fact that in a typical sample, 
6 will differ from the true value 6). 

A simple way to estimate the size of each source of uncertainty is by Monte 
Carlo integration. Suppose we adopt the Bayesian perspective that 6 itself is a 
random variable. From this perspective, [13.7.4] describes the MSE conditional on 
@ = 0. Suppose that the posterior distribution of @ conditional on the data Y; is 
known; the asymptotic distribution for the MLE in [13.4.6] suggests that 6/97 
might be regarded as approximately distributed N(6, (1/T)-$ ~*), where 6 denotes 
the MLE. We might then generate a large number of values of 6, say, 
6), 922. , 92%, drawn froma N (6, (1/T)-$~") distribution. For each draw 
(/), we could calculate the smoothed estimate or forecast E,)7(0). The devia- 
tions of these estimates across Monte Carlo draws from the estimate E.,7(6) 
can be used to describe how sensitive the estimate E.17(6) is to parameter uncer- 
tainty about 8: 

1 2000 


m000 2 [E70 — Eyr (OE 170) - Er (6)]'. [13.7.5] 


[13.7.4] 


This affords an estimate of 


E{{E,r(@) = E,,7(6)) (E78) ie E.17(6)]'|Yr}; 
where this expectation is understood to be with respect to the distribution of 6 
conditional on ¥Y,. 
For each Monte Carlo realization 0“, we can also calculate P,,7 (0) from 
{13.6.20] or [13.3.27]. Its average value across Monte Carlo draws, 
2000 


1 
7000 ~ Prix), [13.7.6] 


398 Chapter 13 | The Kalman Filter 


provides an estimate of the filter uncertainty in [13.7.4], 
E[P.)7(8)|97)- 
Again, this expectation is with respect to the distribution of 6|%,. 
_ The sum of [13.7.5] and [13.7.6] is then proposed as an MSE for the estimate 
£,,7(8) around the true value &,. 


13.8. Time-Varying Parameters 


State-Space Model with Stochastically Varying Coefficients 


Up to this point we have been assuming that the matrices F, Q, A, H, and 
R were all constant. The Kalman filter can also be adapted for more general state- 
space models in which the values of these matrices depend on the exogenous or 
lagged dependent variables included in the vector x,. Consider 


E41 = F(K,)E, + War [13.8.1] 
y, = a(x,) + [H(x,)]’&, + w,. [13.8.2] 


Here F(x,) denotes an (r x r) matrix whose elements are functions of x,; a(x,) 
similarly describes an (n x 1) vector-valued function, and H(x,) an (r x n) matrix- 
valued function. It is assumed that conditional on x, and on data observed through 
date t — 1, denoted 


Wey (Yt Ve-as 2 6 Fle Mem te Mews ee 9 Bh)'s 
the vector (v/,,, w,)’ has the Gaussian distribution 


Vi+ nS 0 Q(x,) 0 
fa enn (Ci earn) 


Note that although [13.8.1] to [13.8.3] generalize the earlier framework by allowing 
for stochastically varying parameters, it is more restrictive in that a Gaussian dis- 
tribution is assumed in [13.8.3]; the role of the Gaussian requirement will be 
explained shortly. 

Suppose it is taken as given that &|%,_, ~ N(E1,-1, Py,-1)- Assuming as 
before that x, contains only strictly exogenous variables or lagged values of y, this 
also describes the distribution of &,|x,, Y,_,. It follows from the assumptions in 
{13.8.1] to [13.8.3] that 


Xr, 2,..| 


- mI Ent | Py, P,,- H(x,) }) 
a(x,) + (H(x,)]'E 4-1 , H’'(x,)Pij.-1 [H(x,)]'Piy.- )H(x,) +R(x,) 
[13.8.4] 


Conditional on x,, the terms a(x,), H(x,), and R(x,) can all be treated as deter- 
ministic. Thus, the formula for the conditional distribution of Gaussian vectors 
[4.6.7] can be used to deduce that® 


Ely,, Xry Y,_, = E,|%, oe NEw P,,); [13.8.5] 


*Here Y, = Yr Y, = &. wi = a(x,) + (H(x)]'Eq—1 = Eats Qy = {{H(x,)}'P,,,-sH(x,) r 
R(x,)}, Q., = Pyy-1, and 22, = P,,- H(x,). 


13.8. Time-Varying Parameters 399 


where 


En — on + {Pes B% [0H0)1 P-Hex) + R(x,)] 7! 
' [13.8.6] 
x Ly, = a(x,) a (Hx 1&,.-1] 


Pj, = Py, = {Pe-1HGx) 
{13.8.7] 

x {[H(x, )]'P,),- 1 H(x,) t R(x,)] ~'{H(x,))'Pye- 1 

It then follows from [13.8.1] and [13.8.3] that &,, ,|%, ~ Neé,. tee Praife), where 
Eni = FE), [13.8.8] 

Phaiy, = F(x,)P,,, [F(x,)]’ + Q(x,). [13.8.9] 

Equations [13.8.6] through [13.8.9] are just the Kalman filter equations (13.2. 15], 
{13.2.16], [13.2.17], and [13.2.21] with the parameter matrices F, Q, A, H, and R 
replaced by their time-varying analogs. Thus, as long as we are willing to treat the 
initial state &, as NE: P,),), the Kalman filter iterations go through the same as 
before. The obvious generalization of [13.4.1] can continue to be used to evaluate 
the likelihood function. 

Note, however, that unlike the constant-parameter case, the inference [13.8.6] 
is a nonlinear function of x,. This means that although [13.8.6] gives the optimal 
inference if the disturbances and initial state are Gaussian, it cannot be interpreted 
as the linear projection of &, on Y, with non-Gaussian disturbances. 


Linear Regression Models with Time-Varying Coefficients 


One important application of the state-space model with stochastically varying 
parameters is a regression in which the coefficient vector changes over time. Con- 
sider 


y, = x;B, + w,, [13.8.10] 


where x, is a (k X 1) vector that can include lagged values of y or variables that 
are independent of the regression disturbance w, for all 7. The parameters of the 
coefficient vector are presumed to evolve over time according to 


(B,.. — B) = F(B, — B) + var. [13.8.11] 


If the eigenvalues of the (kK x k) matrix F are all inside the unit circle, then B has 
the interpretation as the average or steady-state value for the coefficient vector. 


If it is further assumed that 

y, 0 Q 0 
Pb 2,..| ~ v((3]. E fl, [13.8.12] 
then [13.8.10] to [13.8.12] will be recognized as a state-space model of the form 
of [13.8.1] to [13.8.3] with state vector &, = B, — B. The regression in [13.8.10] 
can be written as 


y= xB + x & + w,, [13.8.13] 


which is an observation equation of the form of [13.8.2] with a(x,) = x/B, 
H(x,) = x,, and R(x,) = o?. These values are then used in the Kalman filter 
iterations [13.8.6] to [13.8.9]. A one-period-ahead forecast for [13.8.10] can be 
calculated from [13.8.4] as 


E(y|x,, %,_1) = x/B * 4 ae 


400 Chapter 13 | The Kalman Filter 


where {&,,,_,}/., is calculated from [13.8.6] and [13.8.8]. The MSE of this forecast 
can also be inferred from [13.8.4]: 


El(y, a7 xB ~ ei bicay Er Y,_1] = x, Py, 1X, + o?, 
where {P,,,_ ,}/~, is calculated from [13.8.7] and [13.8.9]. The sample log likelihood 
is therefore 


T 
2 log f(y |x, ¥,-1) = — (72) log(2) — (1/2) > log(x;P,),1x, + 07) 


: 
— (12) py (y, — x/B = x/&,),—)7(x/P,),-1x, + 0°). 


The specification in [13.8.11] can easily be generalized to allow for a pth- 


order VAR for the coefficient vector B, by defining &; = [(B, — B)'. (B,-. — B)’. 
.(B,-p+1 — B)'] and replacing [13.8.11] with 
®, ®, yes ®,_, @, Vint 
L 0: 0 0 0 
B= [0 bv 0 Og + [0 
00-7 k 0 0 


Estimation of a VAR with Time-Varying Coefficients 


Section 12.2 described Litterman’s approach to Bayesian estimation of an 
equation of a vector autoregression with constant but unknown coefficients. A 
related approach to estimating a VAR with time-varying coefficients was developed 
by Doan, Litterman, and Sims (1984). Although efficiency might be improved by 
estimating all the equations of the VAR jointly, their proposal was to infer the 
parameters for each equation in isolation from the others. 

Suppose for illustration that equation [13.8.10] describes the first equation 
from a VAR, so that the dependent variable (y,) is y,, and the (kK x 1!) vector of 


explanatory variables is x, = (1,y/-1, Y;-2.-- ++ Y;-p)', where y, = (Yi. Yar. + 
Ym) and k = np + 1. The coefficient vector is 
oe if 1 ! 2 (2 2 
B, a (¢,,. oh), o', a, Os oO, O13 ous ae eee eat 
Pihrs DiB Is = + « ony, 


where ${), is the coefficient ‘elating y1, to y;,_,. This coefficient is allowed to be 
different for each date ¢ in the sample. 
Doan, Litterman, and Sims specified a Bayesian prior distribution for the 
initial value of the coefficient vector at date 1: 
B, ~ NOB, P iyo). {13. 8. 14] 
The prior distribution is independent across coefficients, so that P,), is a diagonal 
matrix. The mean of the prior distribution, B, is that used by Litterman (1986) for 
a constant-coefficient VAR. This prior distribution holds that changes in y,, are 
probably difficult to forecast, so that the coefficient on y,,,_, is likely to be near 
unity and all other coefficients are expected to be near zero: 
= (0,1,0,0,...,0)'. {13.8.15] 
As in Section 12.2, let y characterize the analyst’s confidence in the prediction that 
¢()), is near unity: 
or) ms N(A, y’). : 
Smaller values of y imply more confidence in the prior conviction that ${)), is near 
unity. 
The coefficient #{°) , relates the value of variable 1 at date 1 to its own value 


13.8. Time-Varying Parameters 401 


s periods earlier. Doan, Litterman, and Sims had more confidence in the prior 
conviction that @), is zero the greater the lag, or the larger the value of s. They 
represented this with a harmonic series for the variance, 
69), ~ N(O, ys) = fors = 2,3,...,p. 
The prior distribution for the coefficient relating variable 1 to lags of other 
variables was taken to be 
wr FF j= ea 
mo MamE) IEP. uses 
As in expression [12.2.4], this includes a correction (77/7?) for the scale of y,, 
telative to y,,, where 7? is the estimated variance of the residuals for a univariate 
fixed-coefficient AR(p) process fitted to series j. The variance in [13.8.16] also 
includes a factor w? < 1 representing the prior expectation that lagged values of 
y; for j # 1 are less likely to be of help in forecasting y, than would be the lagged 
values of y, itself; hence, a tighter prior is used to set coefficients on y, to zero. 
Finally, let g describe the variance of the prior distribution for the constant 
term: 


c,, ~ N(O, g°#?). 
To summarize, the matrix P,), is specified to be 
eu, 0’ 
Pin = - 3.8.17 
” | 0 @6® 0 ret 
where 
y 0 0 0 
0 y?/2 O 0 
B ={|0 O-— 7/3 0 
(pxp) . : : : 
0 0 O «t+ yp 
1 0 0 on 0 
O we? /53 0 0 
c =/}0 0 w42/F2 0 
(it Xa) Ny 
0 oO O sss w2et/42 


For typical economic time series, Doan, Litterman, and Sims recommended using 
y? = 0.07, w? = 1/74, and g = 630. This last value ensures that very little weight 
is given to the prior expectation that the constant term is zero. 

Each of the coefficients in the VAR is then presumed to evolve over time 
according to a first-order autoregression: 


Bi, = 73°B, + (1 — m)°B + vai. [13.8.18] 
Thus, the same scalar mg is used to describe a univariate AR(1) process for each 
element of B,; Doan, Litterman, and Sims recommended a value of 7, = 0.999. 
The disturbance v, is assumed to have a diagonal variance-covariance matrix: 
E(v,v;) = Q. {13.8.19] 
For all coefficients except the constant term, the variance of the ith element of v, 
was assumed to be proportional to the corresponding element of P,\,. Thus, for 
i= 2,3,...,k, the rowi, column i element of Q is taken to be 7, times the row 
i, column i element of Pj). The (1, 1) element of Q is taken to be 7; times the 
(2, 2) element of P,\,. This adjustment is used because the (1, 1) element of Pi) 
represents an effectively infinite variance corresponding to prior ignorance about 


402 Chapter 13 | The Kalman Filter 


the value for the constant term. Doan, Litterman, and Sims recommended 7, = 
10-7 as a suitable value for the constant of proportionality. 
Equation {13.8.18] can be viewed as a state equation of the form 
E41 = FE, + Wai, {13.8.20] 

where the state vector is given by &, = (B, — B) and F = m,-1,. The observation 
equation is 

Vy, = x/B + x/&, + w,- {13.8.21] 
The one parameter yet to be specified is the variance of w,,, the residual in the 
VAR. Doan, Litterman, and Sims suggested taking this to be 0.9 times 77. 

Thus, the sequence of estimated state vectors En hfs is found by iterating 
on [13.8.6] through [13.8.9] starting from Esp. = Oand Piyo given by [13.8.17], with 
F(x,) = = Ty" I, Q(x,) = 17° Piya, a(x,) = mat xB with B given by (13. 8. 15], H(x,) = 
x, and R(x,) = 0.9-42. The estimated coefficient vector is then B,, = =P + é, Ie 
Optimal one-period-ahead forecasts are given by ¥i,41), = x1 By, 

Optimal s-period-ahead forecasts are difficult to calculate. However, Doan, 
Litterman, and Sims suggested a simple approximation. The approximation takes 
the optimal one-period-ahead forecasts for each of the n variables in the VAR, 
J,+11. and then treats these forecasts as if they were actual observations on y,,,. 
Then E(¥r421¥n Viet ses yi) is approximated by E(y:421Y¥e4 Yrvses y1) eval- 
uated at yj4. = Ely,aily Yr. --- > ¥1). The law of iterated expectations does 
not apply here, since E(y,.2l¥,41. Yr. «++ Yu) is a nonlinear function of y,,. 
However, Doan, Litterman, and Sims argued that this simple approach gives a 
good approximation to the optimal forecast. 


APPENDIX 13.A. Proofs of Chapter 13 Propositions 


® Proof of Proposition 13.1.” Recall that P,,,;, has the interpretation as the MSE of the 


linear projection of &,,, on Y, = (ys, Yj. Vie Mr Kiet ee KS 

Poy, = MSE[E(, .1%,)]- [13.A.1] 
Suppose for some reason we instead tried to forecast &, , , using only observations 2,3,..., 
t, discarding the observation for date t = 1. Thus, define Y* = (y,, yi... ---. 93. %), 
Xi, 5 x4)’, and let 

P*,,, = MSE[E(E,.:|¥7)]- (13.A.2] 


Then clearly, (13.A.2] cannot be smaller than [13.A.1], since the linear projection 
E(E,. 11%) made optimal use of ¥* along with the added information in (y}, x{)’. Specif- 
ically, if h is any (r x 1) vector, the linear projection of z,,, = h’'&,,, on Y, has MSE given 
by 


Elz,s1 a E(z,..|9)F Efh'é,., 5 h'- E(é,.,|Y,)F 


h’ Eg.) ie EE, IU) ME 1 = £€,.1%,))}-h 


= h'P,, ,),h. 
Similarly, the linear projection of z,,, on Y* has MSE h'P*,,,,h, with 
h’P,, ,,,h = h'P*, (,h. (13.A.3] 


But for a system of the form of [13.2.1] and [13.2.2] with eigenvalues of F inside the unit 
circle and time-invariant coefficients, it will be the case that 


MSE(E(é,, Ly, Vrrty sees Yar Xp. Xpogs ee es x2)] 

= MSE[E(Ely,_ 1. Yan oe g Yn X,-15 Kypezy ens x,)], 
that is, 

Pri = Phot 

Hence, [13.A.3] implies that 

h’P,, «,h = h’P,,_ jh 

The arguments in the proofs of Propositions 13.1 and 13.2 are adapted from Anderson and Moore 

(1979, pp. 76-82). 


Appendix 13.A. Proofs of Chapter 13 Propositions 403 


for any (r x 1) vector h. The sequence of scalars {h'P,, ,,,h}7_, is thus monotonically 
nonincreasing and is bounded below by zero. It therefore converges to some fixed non- 
negative value. Since this is true for any (r x 1) vector h and since the matrix P, , ,,, issymmet- 
ric, it follows that the sequence {P,, ,,,}7 , converges to some fixed positive semidefinite ma- 
trix P. 

To verify the claims about the eigenvalues of the matrix (F — KH’), note that if P is 
a fixed point of [13.5.3], then it must also be a fixed point of the equivalent difference 
equation [13.2.28]: 


P = (F — KH’)P(F — KH’)' + KRK’ + Q. [13.4.4] 
Let x denote an eigenvector of (F — KH’)' and A its eigenvalue: 
(F - KH’)'x = Ax. [13.4.5] 


Although F, K, and H are al! real, the eigenvalue A and eigenvector x could be complex. 
If x’ denotes the conjugate transpose of x, then 
x"“(F — KH')P(F — KH’)'x = [(F — KH’)'x]”P[(F - KH’)’x] 
[Ax] P[Ax] 
= |A|*x”Px. 
Thus, if [13.A.4] is premultiplied by x’ and postmultiplied by x, the result is 
x"Px = |A[2x/Px + x"(KRK' + Q)x, 


fl 


or 
(t — [A[2)x"'Px = x“(KRK’ + Q)x. ]13.A.6] 


Now, (KRK’ + Q) is positive semidefinite, so the right side of [13.A.6] is nonnegative. 
Likewise, P is positive semidefinite, so x/’Px is nonnegative. Expression [13.A.6] then 
requires |A| = 1, meaning that any eigenvalue of (F — KH’) must be on or inside the unit 
circle, as claimed. @ 


® Proof of Proposition 13.2, First we establish the final claim of the proposition, concerning 
the eigenvalues of (F — KH’). Let P denote any positive semidefinite matrix that satisfies 
[13.4.4], and let K be given by [13.5.4]. Notice that if Q is positive definite, then the right 
side of [13.A.6] is strictly positive for any nonzero x, meaning from the left side of [13.A.6] 
that any eigenvalue A of (F — KH’) is strictly inside the unit circle. Alternatively, if R is 
positive definite, then the only way that the right side of [13.A.6] could fail to be strictly 
positive would be if K’x = 0. But from [13.A.5], this would imply that F’x = Ax, that is, 
that x is an eigenvector and A is an eigenvalue of F’. This, in turn, means that A is an 
eigenvalue of F, in which case |A| < I, by the assumption of stability of F. Thus, there 
cannot be an eigenvector x of (F — KH’)' associated with an eigenvalue whose modulus is 
greater than or equal to unity if R is positive definite. 

Turning next to the rest of Proposition (3.2, let {P,,;,} denote the sequence that 
results from iterating on [13.5.1] starting from an arbitrary positive semidefinite initial value 
Pi, We will show that there exist two other sequences of matrices, to be denoted 
{P,, ud and {P, t us such that 

Poi SP = Py, for alte, 
where 7 
lim P..«, = lim P,,,,, = P 


and where P does not depend on Py». The conclusion will then be that {P,, ,,,} converges 
to P regardless of the value of Piju. 

To construct the matrix P,,,), that is to be offered as a lower bound on P,, (1, 
consider the sequence {P,,,,,} that results from iterating on [13.5.1] starting from the 
initial value P,, = 0. This would correspond to treating the intial state &, as if known 
with certainty: 

PB, we = MSE[E(E,.1Y,, é,)]. (13.A.7] 
Note that y, and x, are correlated with &,,, fort = 1,2,.. . only through the value of &,, 
which means that we could equally well write 


Prue = MSELE,. 197, &)]. [13.4.8] 
where YF = (y;,y/-0.--.Y3, Xs. Xu. -- « X4)'. Added knowledge about &, could not 
hurt the forecast: 

MSE[E(E,, ABE. &2. &,)] = MSE[E(E, 4197, é,)], {13.A.9] 


404 Chapter 13 | The Kalman Filter 


and indeed, &, is correlated with &,,, fort = 2, 3,. . . only through the value of &,: 


MSE[E (19%, & 0] = MSE(E(E,..197*, &)]- [13.A. 10] 
Because coefficients are time-invariant, 
MSE(E(E,..197, &)] = MSE(E(EIY,-1, €:)] = Bare (13.A.11] 


Thus, [13.A.10] and [13.A.11] establish that the left side of [13.A.9] is the same as P,,, ,, 


while from [13.A.8] the right side of [13.A.9] is the same as P,, ,,. Thus, [13.A.9] states 
that 


Pi, = mites 


so that {P,,),} is a monotonically nondecreasing sequence; the farther in the past is the 
perfect information about &,, the less value it is for forecasting &,.,. 
Furthermore, a forecast based on perfect information about €,, for which P,, ,,, 


gives the MSE, must be better than one based on imperfect information about &,, for which 
P,, 1, gives the MSE: 


P 


= 
mete 


+ tle for all t. 


Thus. P,,,), puts a lower bound on P,,,),. aS claimed. Moreover, since thé sequence 
{P,, yb is monotonically nondecreasing and bounded from above, it converges to a fixed 
value P satisfying [13.5.3] and [13.A.4]. in 
: To construct an upper bound on P,, ,;,, consider a sequence {P,, ,,} that begins with 
Pin = Pia. the same starting value that was used to construct {P,,,,,}. Recall that P,, ,, 
gave the MSE of the sequence &,,,,, described in equation [13.2.20]: 


é,, qe = Fé, i+ Ky, -— A’x, - H'é,,,...)- 
Imagine instead using a sequence of suboptimal inferences {€,, ,,,}defined by the recursion 
Eu = Fé,,, 1+ K(y, 3 A'x, at H'E,,,-,), [13.A.12] 


where K is the value calculated from [13.5.4] in which the steady-state value for P is taken 
to be the limit of the sequence {P,, ,,,}. Note that the magnitude &,, ,,, so defined is a 
linear function of Y, and so must have a greater MSE than the optimal inference &,, ,,,: 


P,.. fe = E[(E+1 = é,. wees = gE, ud‘) = P,, Ur 
Thus, we have established that 
P 


“td ble 


a Sa 


and that P,, > P. The proof will be complete if we can further show that P,, ,,, — P. 
Parallel calculations to those leading to [13.2.28] reveal that 


P, 1, = (F - KH’)P,,_,(F — KH’)' + KRK' + Q. [13.4.3] 
Apply the vec operator to both sides of [13.A.13] and recall Proposition 10.4: 
vec(P,,.,,,) = Bvec(P,, .) +6 = [Lz + B+ B+ --- + Boo + B vec(P,,,), 


where 


tele 


% = (F ~— KH’) @ (F — KH’) 
¢ = vec(KRK’ + Q). 


Recall further that since either R or Q is positive definite, the value of K has the 
property that all eigenvalues of (F — KH’) are strictly less than unity in modulus. Thus, all 
eigenvalues of ® are also strictly less than unity in modulus, implying that 


lim vec(P,,1,,) = (L2 - ®)'e, 


the same value regardless of the starting value for P,,,. In particular, if the iteration on 
[13,A.13] is started with P,,, = P, this being a fixed point of the iteration, the result would 
be P..4, = P for all ¢. Thus, . 


lim P., = P, 
regardless of the value of P,., = P,,, from which the iteration for BP...) is started. 


Appendix 13.A. Proofs of Chapter 13 Propositions 405 


@ Proof of Proposition 13.3. Observe that 
{L, + H'(L, — Fz)-'Kz}{L, — H'[L, - (F — KH‘)z]~'Kz} 
=I, - HI, - (F — KH’)z]~'Kz + H’(1, — Fz)~'Kz 
— {H’(I, — Fz)-'Kz}{H'[L. — (F — KH’)z]~'Kz} (13.A.14] 


=1,+ a{—p, — (F — KH’)z}-' + (I, - Fz)-! 
— [L — Fz]-'KH'2{I, - (F - Kaz Ke 


The term in curly braces in the last line of [13.A.14] is indeed zero, as may be verified by 
taking the identity 


~[I, — Fz] + [I, - (F — KH’)z] - KH’z = 0 
and premultiplying by (I, — Fz]~' and postmuttiplying by [I, - (F — KH‘)z]-': 
—(L. — (F — KH')z]-' + [E, - Fz]-! 


- (I, - Fz]-'KH’2(I, — (F - KH')z]-' = 0. [13.A.15] 


@ Proof of Proposition 13.4. Notice that 
{L, + H’(, — Fz)-'KzXH'PH + RMI, + K'(I, — F'z-97'Hz75 
= {H'PH + R} + H’(L, - Fz) -'K{H'PH + R}z (13.4.16] 
+ {H'PH + R}K'(I, — F’z-')~'Hz-' 
+ H'(L, — Fz)-"K{H'PH + R}K'(L, — F’z-")- 4H. 
Now [13.5.4] requires that 


K{H'PH + R} = FPH [13.A.17] 
{H'PH + R}K' = H’PF’ (13.A.18] 

K{H'PH + R}K’ = FPH{H'PH + R}-'H’PF’ 
= BPE’ - P +Q, [13.A.19] 


with the last equality following from [13.5.3]. Substituting [13.A.17] through [13.A.19] into 
{13.A.16] results in 


{L, + H'(L, — Fz)-'KzH'PH + R}L, + K(L, — F’z-')-'Hz-}} 
= {H'PH + R} + H'(I, — Fz)~'FPHz + H’PF'(I, — F'z-')-'Hz-! 
+ H'(I, — Fz)-{FPF’ — P + Q}(I, — F’z-')-'H 
(13.A.20] 
=R +u{p + (I, — Fz)-"FPz + PF’(I, — F’z-')-'z-! 


+ (I, — Fz)" {FPF' - P + QI, - Faye. 


The result in Proposition 13.4 follows provided that 
P+ (I, — Fz)"'FPz + PF'(L, — F’z-')-'z7' (13.A.21] 
+ (I, — Fz)" 4EPF’ — PY, -— F’z7')-' = 0. 
To verify that [13.A.21] is true, start from the identity 
(I, — Fz)P(I, — F’z~') + FPz(I, ~ F'z~') [13.A.22] 
+ (I, — Fz)PF'z~' + FPF’ — P = 0. 


Premultiplying [13.4.2] by (I, — Fz)~! and postmultiplying by (I, — F’z~')~' confirms 
[13.A.21]. Substituting [13.A.21] into [13.A.20] produces the claim in Proposition 13.4. @ 


Chapter 13 Exercises 
13.1. Suppose we have a noisy indicator y on an underlying unobserved random variable 
y=éte. 


406 Chapter 13 | The Kalman Filter 


Suppose moreover that the measurement error (e) is N(0, 77), while the true value é is 
N(w. 07), with € uncorrelated with €. Show that the optimal estimate of & is given by 
ao 
E = pty - 
Gly) =e + 3G - #) 
with associated MSE 


or? 
Efé-E z= : 
([é- 4élwP = ay 
Discuss the intuition for these results as 7? —> © and 7? > 0, 


13.2. Deduce the state-space representation for an AR(p) model in [13.1.14] and [13.1.15] 
and the state-space representation for an MA(1) model given in [13.1.17] and [13.1.18] as 
special cases of that for the ARMA(r, r — 1) model of [13.1.22] and [13.1.23]. 


13.3. Is the following a valid state-space representation of an MA(1) process? 


State Equation: 
E41) _ [0 0 &, E,at 
e]-[ lle) +L] 


y-H=(l al "| 


13.4. Derive equation [13.4.5] as a special case of [13.4.1] and [13.4.2] for the model 
specified in [13.4.3] and [13.4.4] by analysis of the Kalman filter recursion for this case. 
13.5. Consider a particular MA(1) representation of the form of [13.3.1] through [13.3.12] 
parameterized by (8, 0”) with | 6| < 1. The noninvertible representation for the same process 
is parameterized by (@, &) with 6 = 1/6 and & = 6707. The forecast generated by the 
Kalman filter using the noninvertible representation satisfies 
Sree = A’x 4) + HVE. = EB + 6é,,. 

where &,, = {o7°/[G7 + 6°p,]}-{y, — » — 66,_,,,-,}. The MSE of this forecast is 

Evra = Jee)? = H'P,, Hl + R = o + D1 
where p,,, = (o707)/(1 + 67 + 6° + --- + 6”). Show that this forecast and MSE are 
identical to those for the process as parameterized using the invertible representation 
(6, o*). Deduce that the likelihood function given by [13.4.1] and [13.4.2] takes on the same 
value at (0, 0°) as it does at (6, &7). 
13.6. Show that e, in equation [13.5.22] is fundamental for y,. What principle of the Kalman 
filter ensures that this will be the case? Show that the first autocovariance of the implied 
MA(1) error process is given by 


—(¢ — K)E(e7) = — doy 


Observation Equation: 


while the variance is 

(1 + (6 — K)JE(e?) = (1 + *)o% + oF. 
Derive these expressions independently, using the approach to sums of ARMA processes 
in Section 4.7. 


13.7. Consider again the invertible MA(1) of equations [13.3.1] to [13.3.12]. We found 
that the steady-state value of P,,., is given by 


_fo7 0 
p= (9g) 


From this, deduce that the steady-state value of P,,,, = 0 fors = 0,1,2,.... Give the 
intuition for this result. 


Chapter 13 References 


Anderson, Brian D. O., and John B. Moore. 1979. Optimal Filtering. Englewood Cliffs, 
N.J.: Prentice-Hall. 
Burmeister, Edwin, and Kent D. Wall. 1982. “Kalman Filtering Estimation of Unobserved 


Rational Expectations with an Application to the German Hyperinflation.” Journal of Econ- 
ometrics 20:255-84. 


Chapter 13 References 407 


. , and James D. Hamilton. 1986. “Estimation of Unobserved Expected Monthly 
Inflation Using Kalman Filtering.” Journal of Business and Economic Statistics 4:147-60. 
Caines, Peter E. 1988. Linear Stochastic Systems. New York: Wiley. 

Dempster. A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from In- 
complete Data via the EM Algorithm.” Journal of the Royal Statistical Society Series B, 
39:1-38. 

Doan, Thomas, Robert B. Litterman, and Christopher A. Sims. 1984. “Forecasting and 
Conditional Projection Using Realistic Prior Distributions.” Econometric Reviews 3:1~100. 
Fama, Eugene F., and Michael R. Gibbons. 1982. “Inflation, Real Returns, and Capital 
Investment.” Journal of Monetary Economics 9:297-323. 

Gevers, M., and V. Wertz. 1984. “Uniquely Identifiable State-Space and ARMA Param- 
eterizations for Multivariable Linear Systems," Automatica 20:333-47. 

Ghosh, Damayanti. 1989, “Maximum Likelihood Estimation of the Dynamic Shock-Error 
Model.” Journal of Econometrics 41:121—43, 

Hamilton, James D. 1985. “Uncovering Financial Market Expectations of Inflation.” Journal 
of Political Economy 93:1224-41. 

, 1986, ‘A Standard Error for the Estimated State Vector of a State-Space Model.” 
Journal of Econometrics 33:387-97. 

Hannan, E, J. 1971. “The Identification Problem for Multiple Equation Systems with Moving 
Average Errors.” Econometrica 39:751-65, 

Harvey, Andrew, and G. D. A. Phillips. 1979, ‘Maximum Likelihood Estimation of Regres- 
sion Models with Autoregressive-Moving Average Disturbances.” Biometrika 66:49-58. 
Kalman, R. E, 1960, “*A New Approach to Linear Filtering and Prediction Problems.” 
Journal of Basic Engineering, Transactions of the ASME Series D, 82:35-45, 

. 1963. “New Methods in Wiener Filtering Theory,” in John L, Bogdanoff and Frank 
Kozin, eds,, Proceedings of the First Symposium of Engineering Applications of Random 
Function Theory and Probability, 270-388. New York: Wiley. 

Litterman, Robert B, 1986. “Forecasting with Bayesian Vector Autoregressions—Five Years 
of Experience.” Journal of Business and Economic Statistics 4;25-38. 

Meinhold, Richard J., and Nozer D. Singpurwalla. 1983. “Understanding the Kalman Filter,” 
American Statistician 37;123-27, 

Nicholls, D. F., and A. R, Pagan. 1985. “Varying Coefficient Regression,” in E. J, Hannan, 
P. R. Krishnaiah, and M. M. Rao, eds., Handbook of Statistics, Vol. 5. Amsterdam: North- 
Holland. 

Pagan, Adrian. 1980. ‘Some Identification and Estimation Results for Regression Models 
with Stochastically Varying Coefficients.” Journal of Econometrics 13:341—-63. 
Rothenberg, Thomas J. 1971. “Identification in Parametric Models.” Econometrica 39:577- 
91. , 

Shumway, R. H., and D. S. Stoffer, 1982. ‘An Approach to Time Series Smoothing and 
Forecasting Using the EM Algorithm.” Journal of Time Series Analysis 3:253-64, 

Sims, Christopher A. 1982. “Policy Analysis with Econometric Models.” Brookings Papers 
on Economic Activity 1:107-52. 

Stock, James H., and Mark W. Watson. 1991. ‘“‘A Probability Model of the Coincident 
Economic Indicators,” in Kajal Lahiri and Geoffrey H. Moore, eds., Leading Economic 
Indicators: New Approaches and Forecasting Records. Cambridge, England: Cambridge 
University Press. 

Tanaka, Katsuto. 1983. “‘Non-Normality of the Lagrange Multiplier Statistic for Testing the 
Constancy of Regression Coefficients.” Econometrica 51:1577-82. 

Wall, Kent D. 1987. “Identification Theory for Varying Coefficient Regression Models.” 
Journal of Time Series Analysis 8:359-71. 

Watson, Mark W. 1989. ‘‘Recursive Solution Methods for Dynamic Linear Rational Ex- 
pectations Models.” Journal of Econometrics 41:65-89. 

and Robert F. Engle. 1983. “Alternative Algorithms for the Estimation of Dynamic 
Factor, MIMIC, and Varying Coefficient Regression Models.” Journal of Econometrics 
23:385-400, 

White, Halbert. 1982. ‘‘Maximum Likelihood Estimation of Misspecified Models.” Econ- 
ometrica 50:1-25. 


408 Chapter 13 | The Kalman Filter 


14 


Generalized Method 
of Moments 


Suppose we have a set of observations on a variable y, whose probability law 
depends on an unknown vector of parameters @. One general approach to estimating 
@ is based on the principle of maximum likelihood—we choose as the estimate 6 
the value for which the data would be most likely to have been observed. A 
drawback of this approach is that it requires us to specify the form of the likelihood 
function. 

This chapter explores an alternative principle for parameter estimation known 
as generalized method of moments (GMM). Although versions of this approach 
have been used for a long time, the general statement of GMM on which this 
chapter is based was only recently developed by Hansen (1982). The key advantage 
of GMM is that it requires specification only of certain moment conditions rather 
than the full density. This can also be a drawback, in that GMM often does not 
make efficient use of all the information in the sample. 

Section 14.1 introduces the ideas behind GMM estimation and derives some 
of the key results. Section 14.2 shows how various other estimators can be viewed 
as special cases of GMM, including ordinary least squares, instrumental variable 
estimation, two-stage least squares, estimators for systems of nonlinear simulta- 
neous equations, and estimators for dynamic rational expectations models. Exten- 
sions and further discussion are provided in Section 14.3. In many cases, even 
maximum likelihood estimation can be viewed as a special case of GMM. Section 
14.4 explores this analogy and uses it to derive some general asymptotic properties 
of maximum likelihood and quasi-maximum likelihood estimation. 


14.1. Estimation by the Generalized Method 
of Moments 


Classical Method of Moments 


It will be helpful to introduce the ideas behind GMM with a concrete example. 
Consider a random variable Y, drawn from a standard ¢ distribution with v degrees 
of freedom, so that its density is 

_ oy . Pf + 1/2] 

Fu (yes v) x (arv)'?0 (1/2) 

where ['(-) is the gamma function. Suppose we have an i.i.d. sample of size T (y,, 
yo, +++, Yr) and want to estimate the degrees of freedom parameter v, One 
approach is to estimate » by maximum likelihood. This approach calculates the 


409 


(1 + (y?/y)]- 07, — [4.1.1] 


sample log likelihood 
T 
£(v) = > log fy,(y.3 ¥) 


and chooses as the estimate # the value for which £(v) is largest. 

An alternative principle on which estimation of v might be based reasons as 
follows. Provided that v > 2, a standard ¢ variable has population mean zero and 
variance given by 

by = E(¥2) = vv — 2). [14.1.2] 


As the degrees of freedom parameter (v) goes to infinity, the variance [14.1.2] 
approaches unity and the density [14.1.1] approaches that of a standard N(O, 1) 
variable. Let f.., denote the average squared value of y observed in the actual 
sample: 


iT. 
for = (UT) S y?. [14.1.3] 
r=] 


For large T, the sample moment (j4,,7) should be close to the population moment 
(#2): Pe 
M27 — Ma 

Recalling [14.1.2], this suggests that a consistent estimate of v can be obtained by 
finding a solution to 


uv — 2) = far [14.1.4] 
or 
2° fig.r 
> =. 14.1.5 
4 fir — 1 [ 


This estimate exists provided that f,7 > 1, that is, provided that the sample seems 
to exhibit more variability than the N(0, 1) distribution. If we instead observed 
fin.r = 1, the estimate of the degrees of freedom would be infinity—a N(0, 1) 
distribution fits the sample second moment better than any member of the ¢ family. 

The estimator derived from [14.1.4] is known as a classical method of moments 
estimator. A general description of this approach is as follows. Given an unknown 
(a x 1) vector of parameters @ that characterizes the density of an observed variable 
y,, Suppose that a@ distinct population moments of the random variable can be 
calculated as functions of @, such as 


E(Y!) = u,(@) fori = i),i,...,%,. [14.1.6] 
The classical method of moments estimate of @ is the value 6, for which these 


population moments are equated to the observed sample moments; that is, 67 is 
the value for which 


T 
wi(Or) = (UT) S yf fori = i... . ie 
t=1 


An early example of this approach was provided by Pearson (1894). 


Generalized Method of Moments 


In the example of the ¢ distribution just discussed, a single sample moment 
(fi2,7) was used to estimate a single population parameter (v), We might also have 
made use of other moments. For example, if vy > 4, the population fourth moment 
of a standard ¢ variable is 

3y? 
by = E(¥#) = 


(v — 2)(v - 4)’ 


410 Chapter 14 | Generalized Method of Moments 


and we might expect this to be close to the sample fourth moment, 
T 
far = (UT) D yt. 


We cannot choose the single parameter y so as to match both the sample second 
moment and the sample fourth moment. However, we might try to choose v so as 
to be as close as possible to both, by minimizing a criterion function such as 


QO(Y; Yr, Yr-1»+ ++ +91) = 8’ We, [14.1.7] 


a {ine pe 
Her” (= ay — 4) 


Here W is a (2 X 2) positive definite symmetric weighting matrix reflecting the 
importance given to matching each of the moments. The larger is the (1, 1) element 
of W, the greater is the importance of being as close as possible to satisfying [14.1.4]. 

An estimate based on minimization of an expression such as [14.1.7] was 
called a “minimum chi-square” estimator by Cramér (1946, p. 425), Ferguson 
(1958), and Rothenberg (1973) and a ‘‘minimum distance estimator” by Malinvaud 
(1970). Hansen (1982) provided the most general characterization of this approach 
and derived the asymptotic properties for serially dependent processes. Most of 
the results reported in this section were developed by Hansen (1982), who described 
this as estimation by the “generalized method of moments.” 

Hansen’s formulation of the estimation problem is as follows. Let w, be an 
(h x 1) vector of variables that are observed at date ¢, let @ denote an unknown 
(a x 1) vector of coefficients, and let h(@, w,) be an (r x 1) vector-valued function, 
h: (R" x R") — R’. Since w, is a random variable, so is h(®, w,). Let @, denote 
the true value of @, and suppose this true value is characterized by the property 
that 


where 


[14.1.8] 


E{h(@, w,)} = 0. [14.1.9] 
The r rows of the vector equation [14.1.9] are sometimes described as orthogonality 
conditions. Let Uy = (wy, Wr_1,.- +, Wy)’ be a (Th x 1) vector containing all 
the observations in a sample of size T, and let the (r x 1) vector-valued function 
g(@; Y,) denote the sample average of h(8, w,): 


(0; 97) = (1/T) > h(@, w,). [14.1.10] 


Notice that g: R“ — R’. The idea behind GMM is to choose @ so as to make the 
sample moment g(@; Y,7) as close as possible to the population moment of zero; 
that is, the GMM estimator 6, is the value of @ that minimizes the scalar 
Q(0; Yr) = [g(0; Yr)]'Wr{g(6; Y7)], [14.1.1] 

where {W,}7..; is a sequence of (r x r) positive definite weighting matrices which 
may be a function of the data ¥;. Often, this minimization is achieved numerically 
using the methods described in Section 5.7. 

The classical method of moments estimator of v given in [14.1.5] is a special 
case of this formulation with w, = y,, @ = v, Wr = 1, and 


A(®, w,) = y7 — vv — 2) 


T 
8(0; U7) = (UT) > yi — u(y — 2). 


14,1. Estimation by the Generalized Method of Moments 411 


Here, r = a = 1 and the objective function [14.1.11} becomes 


7 2 
Q(8; Yr) = {any 2 yi — WY 2} . 


The smallest value that can be achieved for Q(-) is zero, which obtains when vp is 
the magnitude given in [14.1.5]. 
The estimate of v obtained by minimizing [14.1.7] is also a GMM estimator 


with r = 2 and 
v 
pF ta} 
h(@, w,) = ee 


re ae 
YG _DRG- 4 


Here, g(@; Y,) and W; would be as described in [14.1.7] and [14.1.8]. 

A variety of other estimators can also be viewed as examples of GMM, 
including ordinary least squares, instrumental variable estimation, two-stage least 
squares, nonlinear simultaneous equations estimators, estimators for dynamic ra- 
tional expectations models, and in many cases even maximum likelihood. These 
applications will be discussed in Sections 14.2 through 14.4. 

If the number of parameters to be estimated (a) is the same as the number 
of orthogonality conditions (r), then typically the objective function [14.1,11] will 
be minimized by setting 


2(6,; 97) = 0. [14.1.12] 
Ifa = r, then the GMM estimator is the value 6, that satisfies these r equations, 
If instead there are more orthogonality conditions than parameters to estimate 
(r > a), then [14.1.12] will not hold exactly, How close the ith element of 
2(6,; Y,) is to zero depends on how much weight the ith orthogonality condition 
is given by the weighting matrix Wy. 

For any value of @, the magnitude of the (r x 1) vector g(@; Y) is the sample 
mean of T realizations of the (r x 1) random vector h(@, w,). If w, is strictly 
stationary and h(-) is continuous, then it is reasonable to expect the law of large 
numbers to hold: 


(9; Yr) > E{h(@, w,)}. 
The expression E{h(@, w,)} denotes a population magnitude that depends on the 
value of @ and on the probability law of w,. Suppose that this function is continuous 
in @ and that @, is the only value of @ that satisfies [14.1.9], Then, under fairly 
general stationarity, continuity, and moment conditions, the value of 6, that min- 
imizes [14.1.11] offers a consistent estimate of 0); see Hansen (1982), Gallant and 
White (1988), and Andrews and Fair (1988) for details. 


Optimal Weighting Matrix 


Suppose that when evaluated at the true value @,, the process {h(@,, w,)}7. _.. 
is strictly stationary with mean zero and vth autocovariance matrix given by 


YT, = E{{h(@o, w,)} [h(Oo, w,_.)]'}- [14.1.13] 
Assuming that these autocovariances are absolutely summable, define 


s= S17. [14.1.14] 


peo 


412 Chapter 14 | Generalized Method of Moments 


Recall from the discussion in Section 10.5 that S is the asymptotic variance of the 
sample mean of h(@, w,) 


S= dim T- E{[g(8o; Yr) [g(8o; Y7)]'}. 


The epamal value for the weighting matrix W; in [14.1.11] turns out to be 
given by S~', the inverse of the asymptotic variance matrix. That is, the minimum 
asymptotic variance for the GMM estimator 6, is obtained when 6, , is chosen to 
minimize 


Q(8; Yr) = [g(6; ¥7)]'S~'[g(@; Y7)]. [14.1.15} 


To see the intuition behind this claim, consider a simple linear model in which we 
have r different observations (y,, y2,.... ¥,) With a different population mean 
for each observation (u,, W2,..., u,). For example, y, might denote the sample 
mean in a sample of T, observations on some variable, y. the sample mean from 
a second sample, and so on. In the absence of restrictions, the estimates would 
simply be 4, = y,fori = 1,2,...,7. In the presence of linear restrictions across 
the y’s, the best estimates that are linear functions of the y’s would be obtained 
by generalized least squares. Recall that the GLS estimate of p is the value that 
minimizes 

(y — p)'Q-"y - pw), {14.1.16] 


where y = (Yi, Yar ++ Ye)'s B= (Mis ba, +++ My)’, and Q is the variance- 
covariance matrix of y — 


2 = Efty — wy - wy’). 
The opainal weighting matrix to use with the quadratic form in [14.1,16] is given 
by Q~'. Just as © in [14,1.16] is the variance of (y — ), so S in [14.1.15] is the 
asymptotic variance of T° g(-). 


If the vector process {h(@), w,)}7. -» were serially uncorrelated, then the ma- 
trix S could be consistently estimated by 


= (UT) S [h@,, w,)]{h@,, w,)}. [14.1.17] 


Calculating this magnitude requires knowledge of 8), though it often also turns out 
that 


T F 
Sr = (VT) & (h(6r, w,)]{h(6r, w)}’ > 8 [14.118] 


for 6, any consistent estimate of @), assuming that h(@p, w,) is serially uncorrelated. 

Note that this description of the optimal weighting matrix is somewhat cir- 
cular—before we can estimate @, we need an estimate of the matrix S, and before 
we can estimate the matrix S, we need an estimate of @, The practical procedure 
used in GMM is as follows. An initial estimate 6 is obtained by minimizing 
[14.1.11] with an arbitrary weighting matrix such as Wr. = I,. This estimate of 0 
is then used in [14,1.18] to (proavee an initial estimate §@. Expression [14.1.11] 
is then minimized with W, = [8]-' to arrive at a new GMM estimate 6°. This 
process can be iterated until 64 = 6Y*", though the estimate based on a single 
iteration 6°) has the same asymptotic : distribution as that based on an arbitrarily 
large number of iterations, Iterating nevertheless offers the practical advantage 
that the resulting estimates are invariant with respect to the scale of the data and 
to the initial weighting matrix for W;. 

On the other hand, if the vector process {h(@), w,)}7. -= is serially correlated, 


14.1, Estimation by the Generalized Method of Moments 413 


the Newey-West (1987) estimate of S could be used: 


S= fart SO- b@ + DME + Pid, 0419 
where 
T 
l..7 = (VT) > [né, w)]{h(6, w,_,]', [14.1.20] 


with 6 again an initial consistent estimate of @). Alternatively, the estimators 
proposed by Gallant (1987), Andrews (1991), or Andrews and Monahan (1992) 
that were discussed in Section 10.5 could also be applied in this context. 


Asymptotic Distribution of the GMM Estimates 


Let 6,. be the value that minimizes 
[e(@; Y7))'S7'[e(@; Y7)), [14.1.21] 


with §, regarded as fixed with respect to @ and $;-5 S, Assuming an interior 
optimum, this minimization is achieved by setting the derivative of [14.1.21] with 
respect to @ to zero. Thus, the GMM estimate 6, is typically a solution to the 
following system of nonlinear equations: 


ag(0; Y i p 
[een } x SF! x [e(67; Y7)] = 0. [14.1.22] 
0= ér. _ 
{axr) (Xr) (rxt) (ax 1) 


Here [dg(@; Y,)/80']|»-8, denotes the (r x a) matrix of derivatives of the function 
2(0; Y,), where these derivatives are evaluated at the GMM estimate 6,. 

Since g(8); Y7) is the sample mean of a process whose population mean is 
zero, g(’) should satisfy the central limit theorem given conditions such as strict 
stationarity of w,, continuity of h(@, w, ), and restrictions on higher moments. Thus, 
in many instances it should be the case that 

VT (003 Yr) > NO, S). 
Not much more than this is needed to conclude that the GMM estimator 6, is 
asymptotically Gaussian and to calculate its asymptotic variance, The following 
proposition, adapted from Hansen (1982), is proved in Appendix 14,A at the end 
of this chapter. 


Proposition 14.1: Let g(@; Y,) be differentiable in @ for all Y,, and let 6, be the 
GMM estimator satisfying [14.1.22] with r = a. Let {87}, be a sequence of positive 
definite (r X r) matrices such that $,—> S, with S positive definite. Suppose, further, 
that the following hold: 


(a) 6, + Ou; ; 
(b) VT 88); Y,) > N(O, S); and 
(c) for any sequence {07-}7., Satisfying 07. = 0, it is the case that 


«| o8(8; Yr) = lim) 2808; Yr) 
pin 30° _ = plim 30" 


\. D’, [14.1.23] 
@=60 


(r Xa) 


with the columns of D’ linearly independent. 
Then 
VT(6, — Q) > N(O, V), (14.1.24] 


414 Chapter 14 | Generalized Method of Moments 


where 
V = {DS-'D}-.. 


Proposition 14.1 implies that we can treat 6; approximately as 
6, = N(®, V7/T), [14.1.25] 
where 
V, = {b,$7'D;}-'. 
The estimate §, can be constructed as in [14.1.18] or [14.1.19], while 


b= ag(0; Y,) 
r= t 
(rx) a0 e=e7 


Testing the Overidentifying Restrictions 


When the number of orthogonality conditions exceeds the number of param- 
eters to be estimated (r > a), the model is overidentified, in that more orthogonality 
conditions were used than are needed to estimate @. In this case, Hansen (1982) 
suggested a test of whether all of the sample moments represented by g(6,; 7) 
are as close to zero as would be expected if the corresponding population moments 
E{h(@,, w,)} were truly zero. 

From Proposition 8.1 and condition (b) in Proposition 14.1, notice that if the 
population orthogonality conditions in [14.1.9] were all true, then 

[VF-2(00; Y7)]'S “IVT £(00; Yr)] > x70). [14.1.26] 
In [14.1.26], the sample moment function g(8; Y,) is evaluated at the true value 
of 6. One’s first guess might be that condition [14.1.26] also holds when [14.1.26] 
is evaluated at the GMM estimate 6,. However, this is not the case. The reason 
is an [14.1.22] implies that @ different linear combinations of the (r x 1) vector 
a(6 7; Y¥_) are identically zero, these being the a linear combinations obtained when 
2(6,; ¥,) is premultiplied by the (a x r) matrix 


{2x Yr) ‘} gat 


30’ 
For example, when a = r, all linear combinations of 2(6;; Y,) are identically zero, 
and if @ were replaced by 6,, the magnitude in [14.1.26] would simply equal zero 
in all samples. 

Since the vector g(6,; Y) contains (r — a) nondegenerate random variables, 
it turns out that a correct test of the overidentifying restrictions for the case when 
r >acan be based on the fact that 

[VF -8(673 Ur)'SF'IVT- 8673 Ur) > xr - 2). [14.1.27] 
Moreover, this test statistic is trivial to calculate, for it is simply the sample size T 
times the value attained for the objective function [14.1.21] at the GMM estimate 
07. 

Unfortunately, Hansen’s y? test based on [14.1.27] can easily fail to detect a 
misspecified model (Newey, 1985). It is therefore often advisable to supplement 
this test with others described in Section 14.3. 


14.2. Examples 


This section shows how properties of a variety of different estimators can be ob- 
tained as special cases of Hansen’s results for generalized method of moments 


14.2, Examples 415 


estimation. To facilitate this discussion, we first summarize the results of the pre- 
ceding section. 


Summary of GMM 


The statistical model is assumed to imply a set of r orthogonality conditions 
of the form 
E{h(6,, w,)} = 0, [14.2.1] 
(xl) (rx 1) 
where w, is a Strictly stationary vector of variables observed at date ft, @, is the true 
value of an unknown (a x 1) vector of parameters, and h(-) is a differentiable 
r-dimensional vector-valued function with r = a. The GMM estimate 6, is the 
value of @ that minimizes 


[2(@; a7) Sr '[e(@: ¥ Y7)], [14.2.2] 
where 
(0; U7) = (1/7) 3 h(9, w,) [14.2.3] 


t=] (rx 


and §, is an estimate of! 


S = lim wr) S > FA {n., wAMb Os, we [14.2.4] 


rx Tox 

The GMM estimate can be treated as if 
6, ~N(®, V7/T), [14.2.5] 
(ax 1) (nx 1) (axa) 

where 

= {B, -$7!-b7}-! [14.2.6] 
(a Xa) (axr) (rxr) (rxa) 
and 


, _ 98(8; Y7) 
by, = BOP) 
(ex a) 00 
We now explore how these results would be applied in various special cases. 


[14.2.7] 


e=6y 


Ordinary Least Squares 


Consider the standard linear regression model, 
y, = xB + uy, [14.2.8] 
for x, a (k X 1) vector of explanatory variables, The critical assumption needed 


to justify OLS regression is that the regression residual wu, is uncorrelated with the 
explanatory variables: 


E(x,u,) = 0. 14.2.9] 


‘Under strict stationarity, the magnitude 


E{[h(@,, w,)]{h(@,, WwW, oy} =f, 


does not depend on t. The expression in the text is more general than necessary undcr the staled 
assumptions. This expression is appropriate for a characterization of GMM that does not assume strict 
stationarity. The expression in the text is also helpful in suggesting estimates of S that can be used in 
various special cases described later in this section. 


416 Chapter 14 | Generalized Method of Moments 


In other words, the true value B, is assumed to satisfy the condition 
Elx,(y, — x;Bo)] = 0. [14.2.10] 
Expression [14.2.10] describes & orthogonality conditions of the form of [14.2.1], 
in which w, = (y,, X;)’, @ = B, and 
h(@, w,) = x,(y, — x/B). [14.2.11] 
The number of orthogonality conditions is the same as the number of unknown 
parameters in B, so that r = a = k. Hence, the standard regression model could 
be viewed as a just-identified GMM specification. Since it is just identified, the 


GMM estimate of B is the value that sets the sample average value for [14.2.11] 
equal to zero: 


T 
0 = g(67; ¥7) = (WT) & xy, — x/Br). [14.2.12] 


Rearranging [14.2.12] results in 


T T A 
> xy, = {3 sai By 


t=1 


B; = {2 waif {2 sn. [14.2.13] 


t= t=1 


or 


which is the usual OLS estimator. Hence, OLS is a special case of GMM. 

Note that in deriving the GMM estimator in [14.2.13] we assumed that the 
residual u, was uncorrelated with the explanatory variables, but we did not make 
any other assumptions about heteroskedasticity or serial correlation of the residuals. 
In the presence of heteroskedasticity or serial correlation, OLS is not as efficient 
as GLS. Because GMM uses the OLS estimate even in the presence of hetero- 
skedasticity or serial correlation, GMM in general is not efficient. However, recall 
from Section 8.2 that one can still use OLS in the presence of heteroskedasticity 
or serial correlation. As long as condition [14.2.9] is satisfied, OLS yields a con- 
sistent estimate of B, though the formulas for standard errors have to be adjusted 
to take account of the heteroskedasticity or autocorrelation. 

The GMM expression for the variance of B, is given by [14.2.6]. Differen- 
tiating [14.2.11], we see that 


ag(@; Y7) 


D; = 30° 


@=67 


(1/T) > ee x;B) 


-(1/T) y Xx/. 
t=! 


Substituting [14.2.11] into [14.2.4] results in 


[14.2.14] 


B=Br 


S= lim (1/T) > > E{u,u,_,x,x/_,}. [14.2.15] 


t=] ree 


Suppose that u, is regarded as conditionally homoskedastic and serially un- 
correlated: 


o°E(x,x;)  fory = 0 


E “vv ae bas 
{etre Xero} {e forv # Q. 


14.2. Examples 417 


In this case the matrix in [14.2.15] should be consistently estimated by 
T 
§, = 62 (1/T) > x,x}, [14.2.16] 
t=1 
where 


T 
67 = (1/T) 2 a; 
ic 
for i, = y, — x/B, the OLS residual. Substituting [14.2.14] and [14.2. 16] into 
[14.2.6] produces a variance-covariance matrix for the OLS estimate B; of 


anf arn » sxi| 6207) > xx: | (1/T) S xxi} 


T -t 
67 Xx; [. 
1=1 


Apart from the estimate of o7, this is the usual expression for the variance of the 
OLS estimator under these conditions. 

On the other hand, suppose that u, is conditionally heteroskedastic and serially 
correlated. In this case, the estimate of § proposed in [14.1.19] would be 


8, = fart S = bg + DCL + Pe, 


(1/T)V, 


where 
T 


fie = (UT) SD af) .Xxi0- 
t=r+t 
Under these assumptions, the GMM approximation for the variance-covariance 
matrix of B; would be 


~1 

anf am S x,x! $7! (1/T) » xi] 
: = . — 

b> xxi | s,| 5 x , 


which is the expression derived earlier in equation [10.5.21]. White’s (1980) het- 
eroskedasticity-consistent standard errors in [8.2.35] are obtained as a special case 
when q = 0. 


E((Br — B)(Br - B)'] 


Instrumental Variable Estimation 


Consider again a linear model 
y= 2B + u,, [14.2.17] 
where z, is a (kK x 1) vector of explanatory variables. Suppose now that some of 
the explanatory variables are endogenous, so that E(z,u,) # 0. Let x, be an 


(r x 1) vector of predetermined explanatory variables that are correlated with z, 
but uncorrelated with u,: 


E(x,u,) = 0. 
The r orthogonality conditions are now 
E[x,(y, — 2; By)] = 0. [14.2.18] 


This again will be recognized as a special case of the GMM framework in which 
w, = (2, %:)', ® = B, a = k, and 
h(6, w,) a x(y, — 2B). {14.2.19] 


418 Chapter 14 | Generalized Method of Moments 


Suppose that the number of parameters to be estimated equals the number 
of orthogonality conditions (a = k = r). Then the model is just identified, and 
the GMM estimator satisfies 


T 
0 = g(67;%,) = (IT) 2 x,(y, — 2) Br) [14.2.20] 


which is the usual instrumental variable estimator for this model. To calculate the 
standard errors implied by Hansen’s (1982) general results, we differentiate [14.2.19] 
to find 


or 


= (1/T) > site Z; B) 


T 
~(W/T) Dx, 
f=1 


[14.2.21] 


B=Br 


The requirement in Proposition 14.1 that the plim of this matrix have linearly 
independent columns is the same condition that was needed to establish consistency 
of the /V estimator in Chapter 9, namely, the condition that the rows of E(z,x;) 
be linearly independent. The GMM variance for fis seen from [14.2.6] to be 


T T = 
(UT)V; = wn {an > zx [am py sail} , [4.2.22] 


where §, is an estimate of 


S = lim wT) > > E{u,u,..X,x!_,}. [14.2.23] 


Y t=] yee 


If the regression residuals {u,} are serially uncorrelated and homoskedastic with 
variance o?, the natural estimate of S is 


§, = 63. (1/T) » X,X/ [14.2.24] 


for 62 = (1/T) 22, (y, — z/B7)?. Substituting this estimate into [14.2.22] yields 


E((Br - B)(Br ~ B y= 0H [3 aa J[E~] Bd) 
= aS wai] [3 sx] b x] ; 


the same result derived earlier in [9.2.30]. On the other hand, a heteroskedasticity- 
and autocorrelation-consistent variance-covariance matrix for /V estimation is given 
by 


E[(Br - B)(Br - B)'] = 7/3 x2 s| 5 ox, | »  [14.2.25] 


14.2. Examples 419 


where 
q 


Sr= Part 3 fl - bg + DBE + Pr), [14.2.26] 
p T 
Tyr = (1/T) BY 4,0, XX; 
f=vel 
a, ae aa z; Br. 


Two-Stage Least Squares 


Consider again the linear model of [14.2.17] and [14.218], but suppose now 
that the number of valid instruments r exceeds the number of explanatory variables 
k. For this overidentified model, GMM will no longer set all the sample orthog- 
onality conditions to zero as in [14.2.20], but instead will be the solution to [14.1.22], 


0= {=e Yr) 


BM) bx 87" x tatGrs On 


e=er 


[14.2.27] 
- { -(WT) > zai}sz{ aun > 9 - Bo}. 


with the last line following from [14.2.21] and [14.2.20]. Again, if u, is serially 
uncorrelated and homoskedastic with variance o?, a natural estimate of S is given 
by [14.2.24]. Using this estimate, [14.2.27] becomes 


=i] 4 
(1/63.) x > ai} {3 xxi} {2 x,(y, — xiBr)} = 0. [14.2.28] 


As in expression [9.2.5], define 


T T = 
s' = > zai} ps xi} : 


Thus, 8' is a (k x r) matrix whose ith row represents the coefficients from an 
OLS regression of z, on x,. Let 
2, = 8'x, 
be the (kK x 1) vector of fitted values from these regressions of z, on x,. Then 
[14.2.28] implies that 
T 


> i(y. — tir) = 0 


r=] 


nae] Eo} 


Thus, the GMM estimator for this case is simply the two-stage least squares esti- 
mator as written in [9.2.8]. The variance given in [14.2.6] would be 


(1/T) [en ) ax |sz'{ aur) xxi} 


o(Eo][és] Ex] 


420 Chapter 14 | Generalized Method of Moments 


or 


(UT)V 


as earlier derived in expression [9.2.25]. A test of the overidentifying assumptions 
embodied in the model in [14.2.17] and [14.2.18] is given by 


T[g(6r; Y,))'S7 ; [2(67; 4,7)] 


r {arn S x(y xib.)} {earn > sai} 


tt 


T 
x {wn 2 xy, co xiBr)} 


T T -ler 
A-—2 a ot , A 
or > u,X, > X,X, ‘ X,U, ¢- 
t=1 11 t= 


This magnitude will have an asymptotic x? distribution with (r ~ k) degrees of 
freedom if the model is correctly specified. 

Alternatively, to allow for heteroskedasticity and autocorrelation for the re- 
siduals u,, the estimate §, in [14.2.24] would be replaced by [14.2.26]. Recall the 
first-order condition [14.2.27]: 


farm » zai }se{wr) y x,(¥, — xi6,)| =0. [14.2.29] 


If we now define 


it 


z, = 5'x, 

~ r a 

i= (ut) & axi} 87", 
tal 


then [14.2.29] implies that the GMM estimator for this case is given by 


This characterization of 8, is circular—in order to calculate B,, we need to know 
z, and thus §,, whereas to construct $8, from [14.2.26] we first need to know B+. 
The solution is first to estimate B using a suboptimal weighting matrix such as 
§, = (1/T)27_,x,x/, and then to use this estimate of S to reestimate B. The 
asymptotic variance of the GMM estimator is given by 


E[(Br — B)(Br ~ B)'] = r{[3 axi|s7'| 3 sai 


Nonlinear Systems of Simultaneous Equations 


Hansen’s (1982) GMM also provides a convenient framework for estimating 

the nonlinear systems of simultaneous equations analyzed by Amemiya (1974), 

Jorgenson and Laffont (1974), and Gallant (1977). Suppose that the goal is to 

estimate a system of n nonlinear equations of the form 
y, = £(0,z,) + 

for z,a (kK Xx 1) vector of explanatory variables and @ an (a x 1) vector of unknown 

parameters. Let x, denote a vector of instruments that are uncorrelated with the 
ith element of u,. The r orthogonality conditions for this model are 

[Yu a fi, Z,))X1 

n(@, w,) sae [Yor vas £6. Z,)) Xz, 


[ne > £,(8, Z,) ne 


14.2. Examples 421 


where f,(@, z,) denotes the ith element of f(@, z,) and w, = (y;, z;, x/)'. The 
GMM estimate of @ is the value that minimizes 


Q(8; Y,) = [wn ¥ n(@, | 87'[ am > h(@, w} [14.2.30] 


where an estimate of S that could be used with heteroskedasticity and serial cor- 
relation of u, is given by 


Sr =Par + 2 fl-(@ + D2 + Pir) 


r= (UT) > in, w,)][h(6, w,_,.)]’. 


Minimization of [14.2.30] can be caine numerically. Again, in order to evaluate 
{14.2.30], we first need an initial estimate of S. One approach is to first minimize 
{14.2.30] with S; = I,, use the resulting estimate 6 to construct a better estimate 
of S,, and recalculate 6; the procedure can be iterated further, if desired. Iden- 
tification requires an order condition (r = a) and the rank condition that the columns 
of the plim of B+. be linearly sa Ti where 

ah(@, w,) 

= (1/T) > Bem ae a 


Standard errors for 6, are then readily calculated from [14.2.5] and [14.2.6]. 


Estimation of Dynamic Rational Expectation Models 


People’s behavior is often influenced by their expectations about the future. 
Unfortunately, we typically do not have direct observations on these expectations. 
However, it is still possible to estimate and test behavioral models if people’s 
expectations are formed rationally in the sense that the errors they make in fore- 
casting are uncorrelated with information they had available at the time of the 
forecast. As long as the econometrician observes a subset of the information people 
have actually used, the rational expectations hypothesis suggests orthogonality 
conditions that can be used in the GMM framework. 

For illustration, we consider the study of portfolio decisions by Hansen and 
Singleton (1982). Let c, denote the overall level of spending on consumption goods 
by a particular stockholder during period ¢. The satisfaction or utility that the 
stockholder receives from this spending is represented by a function u(c,), where 
it is assumed that 

Gu) gq Bucs) 
ac, ac? 
The stockholder is presumed to want to maximize 


S BTE{u(c,.7)|¥?}, {14.2.31] 
r=} 


where x;* is a vector representing all the information available to the stockholder 
at date ¢ and @ is a parameter satisfying 0 < 6 < 1. Smaller values of 6 mean that 
the stockholder places a smaller weight on future events. At date t, the stockholder 
contemplates purchasing any of m different assets, where a dollar invested in asset 
i at date ¢ will yield a gross return of (1 + 7;,,,) at date ¢ + 1; in general this rate 
of return is not known for certain at date t. Assuming that the stockholder takes 
a position in éach of these m assets, the stockholder’s optimal portfolio will satisfy 


u'(c,) = BELL + ria ie'(a.)|x7}0 fori = 1,2,...,m,  [14.2.32] 


<0. 


422 Chapter 14 | Generalized Method of Moments 


where w'(c,) = du/dc,. To see the intuition behind this claim, suppose that condition 
[14.2.32] failed to hold. Say, for example, that the left side were smaller than the 
right. Suppose the stockholder were to save one more dollar at date ¢ and invest 
the dollar in asset i, using the returns to boost period t + 1 consumption. Following 
this strategy would cauSe consumption at date ¢ to fall by one dollar (reducing 
[14.2.31] by an amount given by the left side of [14.2.32]), while consumption at 
date t + 1 would rise by (1 + 7;,,,) dollars (increasing [14.2.31] by an amount 
given by the right side of [14.2.32]). If the left side of [14.2.32] were less than the 
right side of [14.2.32], then the stockholder’s objective [14.2.31] would be improved 
under this change. Only when [14.2.32] is satisfied is the stockholder as well off 
as possible.” 
Suppose that the utility function is parameterized as 


cl-y 


= 
u(c,) = Y 
log c¢, for y = 1. 


fory>Oandy #1 


The parameter y is known as the coefficient of relative risk aversion, which for this 
class of utility functions is a constant. For this function, [14.2.32] becomes 


cr’ = BELL + rise en% |x? }- [14.2.33} 
Dividing both sides of [14.2.33] by ¢,-7 results in 
1= BEX(1 + Fiat MCrailG)1XF}: [14.2.34] 


where c, could be moved inside the conditional expectation operator, since it rep- 
resents a decision based solely on the information contained in x*. Expression 
[14.2.34] requires that the random variable described by 
L— BAL + rise (Cr+ 1/6) “7% [14.2.35] 
be uncorrelated with any variable contained in the information set x* for any asset 
i that the stockholder holds. It should therefore be the case that 
EX{L ~ BL + riseicrai/c)“Hx,} = 0, [14.2.36] 
where x, is any subset of the stockholder’s information set x* that the econome- 
trician is also able to observe. 
Let @ = (8, 7)’ denote the unknown parameters that are to be estimated, 


and let w, = (rypats Torats + © + > Tmarats Ga+il/C,, X;)' denote the vector of variables 
that are observed by the econometrician for date ¢. Stacking the equations in 
[14.2.36] for i = 1, 2, ....,m produces a set of r orthogonality conditions that 


can be used to estimate 0: 


(1 — BUQL + rieat)(Crai/e,) 7 HX 
n(@, w,) ac [i a B{a + Tae MCr+i/c) “3%, i 


(rx 1) 


[14.2.37] 


{1 + A{(1 + Pine )(Ce+i ler) FX, 
The sample average value of h(@, w,) is 
T 
a(0; Yr) = (VT) S h(O, w,), 
= 
and the GMM objective function is 


Q(0) = [g(0; ¥r)]'S7'[e(@: Y,)). [14.2.38] 

This expression can then be minimized numerically with respect to @. 
According to the theory, the magnitude in [14.2.35] should be uncorrelated 
with any information the stockholder has available at time ¢, which would include 


*For further details, see Sargent (1987). 


14.2, Examples 423 


lagged values of [14.2.35}. Hence, the vector in [14.2.37] should be uncorrelated 
with its own lagged values, suggesting that S can be consistently estimated by 


T 
§;, = (1/T) Dy {[h(6, w,)][h(6, w,)]’}, 


where 6 is an initial consistent estimate. This initial estimate 6 could be obtained 
by minimizing [14.2.38] with 8; = I,. 

Hansen and Singleton (1982) estimated such a model using real consumption 
expenditures for the aggregate United States divided by the U.S. population as 
their measure of c,. For r,,, they used the inflation-adjusted return that an investor 
would earn if one dollar was invested in every stock listed on the New York Stock 
Exchange, while r,, was a value-weighted inflation-adjusted return corresponding 
to the return an investor would earn if the investor owned the entire stock of each. 
company listed on the exchange. Hansen and Singleton’s instruments consisted of 
a constant term, lagged consumption growth rates, and lagged rates of return: 


Kp = (Ly Cs Cr Mpa os Ce mes Mis Mint es 


Tip-tads Tass Vag-ay 6+ 9 Pape)’ 
When € lags are used, there are 3€ + 1 elements in x,, and thus r = 2(3€ + 1) 
separate orthogonality conditions are represented by [14.2.37]. Since a = 2 pa- 
rameters are estimated, the x? statistic in [14.1.27] has 6€ degrees of freedom. 


14.3. Extensions 


GMM with Nonstationary Data 


The maintained assumption throughout this chapter has been that the (A x 1) 
vector of observed variables w, is strictly stationary. Even if the raw data appear 
to be trending over time, sometimes the model can be transformed or reparame- 
terized so that stationarity of the transformed system is a reasonable assumption. 
For example, the consumption series {c,} used in Hansen and Singleton’s study 
(1982) is increasing over time. However, it was possible to write the equation to be 
estimated [14.2.36] in such a form that only the consumption growth rate (c,, ;/c¢,) 
appeared, for which the stationarity assumption is much more plausible. Alter- 
natively, suppose that some of the elements of the observed vector w, are presumed 
to grow deterministically over time according to 


w,=a+db-t+wr, [14.3.1] 
where a and 6 are (A x 1) vectors of constants and w* is strictly stationary with 


mean zero. Suppose that the orthogonality conditions can be expressed in terms 
of wt as 


EXf(®, wr )} = 0. 
Then Ogaki (1993) recommended jointly estimating @, a and 8 using 


h( 2 w, - a — dt 
al eae reer 


to construct the moment condition in [14.2.3]. 


Testing for Structural Stability 


Suppose we want to test the hypothesis that the (2 x 1) parameter vector @ 
that characterizes the first 7, observations in the sample is different from the value 


424 Chapter 14 | Generalized Method of Moments 


that characterizes the last T — T,, observations, where 7, is a known change point. 
One approach i is to obtain an estimate 6, ,, based solely on the first T, observations, 
minimizing 


Q(813 Wr Wri Wy) 


- [city 3 me, w| 8 ch (Ut m0, 9} 43:2) 


where, for example, if {h(@,, w,)} is serially uncorrelated, 
To 


8.7 = (1/Ty) >> [h(6,.7,. w,)][h(6,.7.. w,)I'. 


Proposition 14.1 implies that 
VT (61.71 — 41) + N(O, V)) [14.3.3] 
as T, > ©, where V, can be estimated from 
Vir me {D, 748 thy p ita! 
for 
5 zi ah(@,, 
Bir, = (UT) 5 BOL wd 
t= 00; 


8) =4).7, 
Similarly, a separate estimate 6..7- 7 can be based on the last T — T, observations, 
with analogous measures S,7_ 7, V2.7—7y, Do.r—74, and 
a L 
VT — T)(02.7-7, — 82) > N(O, V2) [14.3.4] 


as T — Ty, > ©. Let 7 = T,/T denote the fraction of observations contained in 
the first subsample. Then [14.3.3] and [14.3.4] state that 


VT(6..7, ~ 01) > NO, Vim) 
VT (62.71, - @,) > N(O, V2/(1 - t)) 
as T — ©, Andrews and Fair (1988) suggested using a Wald test of the null 


hypothesis that @, = @,, exploiting the fact that under the stationarity conditions 
needed to justify Proposition 14.1, 6, is asymptotically independent of 6,: 


Ar = T6174 he 6..7-1d' n : 
X {ao Viny t mo Ve pint Orn — 82.7-7): 


Then is x(a) under the null hypothesis that @, = 6). 

One can further test for structural change at a variety of different possible 
dates, repeating the foregoing test for all 7, between, say, 0.157 and 0.85T and 
choosing the largest value for the resulting test statistic A7. Andrews (1993) de- 
scribed the asymptotic distribution of such a test. 

Another simple test associates separate moment conditions with the obser- 
vations before and after 7, and uses the y? test suggested in [14.1.27] to test the 
validity of the separate sets of conditions. Specifically, let 


1 fort = To 
dy, = 
0 fort > To. 
If h(@, w,) is an (r < 1) vector whose population mean is zero at 09, define 
h(6, w,)-dy, 
h*(@, w,, d,,) = : 
( (rx) ) h(®, w,)-(1 — d,,) 
The a elements of @ can then be estimated by using the 2r orthogonality conditions 
given by E{h*(@o, w,, d,,)} = @ fort = 1,2,..., 7, by simply replacing h(@, w,) 


14.3. Extensions 425 


in [14.2.3] with h*(®, w,, d,,) and minimizing [14.2.2] in the usual way. Hansen’s 
x? test statistic described in [14.1.27] based on the h*(-) moment conditions could 
then be compared with a y?(2r — a) critical value to provide a test of the hypothesis 
that @, = 6. 

A number of other tests for structural change have been proposed by Andrews 
and Fair (1988) and Ghysels and Hall (1990a, b). 


GMM and Econometric Identification 


For the portfolio decision model [14.2.34], it was argued that any variable 
would be valid to include in the instrument vector x,, as long as that variable was 
known to investors at date ¢ and their expectations were formed rationally. Essen- 
tially, [14.2.34] represents an asset demand curve. In the light of the discussion of 
simultaneous equations bias in Section 9.1, one might be troubled by the claim 
that it is possible to estimate a demand curve without needing to think about the 
way that variables may affect the demand and supply of assets in different ways. 

As stressed by Garber and King (1984), the portfolio choice model avoids 
simultaneous equations bias because it postulates that equation [14.2.32] holds 
exactly, with no error term. The model as written claims that if the econometrician 
had the same information x* used by investors, then investors’ behavior could be 
predicted with an R? of unity. If there were no error term in the demand for oranges 
equation [9.1.1], or if the error in the demand for oranges equation were negligible 
compared with the error term in the supply equation, then we would not have had 
to worry about simultaneous equations bias in that example, either. 

It is hard to take seriously the suggestion that the observed data are exactly 
described by [14.2.32] with no error. There are substantial difficulties in measuring 
aggregate consumption, population, and rates of return on assets. Even if these 
aggregates could in some sense be measured perfectly, it is questionable whether 
they are the appropriate values to be using to test a theory about individual investor 
preferences. And even if we had available a perfect measure of the consumption 
of an individual investor, the notion that the investor’s utility could be represented 
by a function of this precise parametric form with y constant across time is surely 
hard to defend. 

Once we acknowledge that an error term reasonably ought to be included in 
{14.2.32], then it is no longer satisfactory to say that any variable dated f or earlier 
is a valid instrument. The difficulties with estimation are compounded by the 
nonlinearity of the equations of interest. If one wants to take seriously the possibility 
of an error term in [14.2.32] and its correlation with other variables, the best 
approach currently available appears to be to linearize the dynamic rational ex- 
pectations model. Any variable uncorrelated with both the forecast error people 
make and the specification error in the model could then be used as a valid in- 
strument for traditional instrumental variable estimation; see Sill (1992) for an 
illustration of this approach. 


Optimal Choice of Instruments 


If one does subscribe to the view that any variable dated ¢ or earlier is a valid 
instrument for estimation of [14.2.32], this suggests a virtually infinite set of possible 
variables that could be used. One’s first thought might be that, the more orthog- 
onality conditions used, the better the resulting estimates might be. However, 
Monte Carlo simulations by Tauchen (1986) and Kocherlakota (1990) strongly 
suggest that one should be quite parsimonious in the selection of x,. Nelson and 


426 Chapter 14 | Generalized Method of Moments 


Startz (1990) in particular stress that in the linear simultaneous equations model 
y, = Z;B + u,, a good instrument not only must be uncorrelated with u,, but must 
also be strongly correlated with z,. See Bates and White (1988), Hall (1993), and 
Gallant and Tauchen (1992) for further discussion on instrument selection. 


14.4. GMM and Maximum Likelihood Estimation 


In many cases the maximum likelihood estimate of @ can also be viewed asa GMM 
estimate. This section explores this analogy and shows how asymptotic properties 
of maximum likelihood estimation and quasi-maximum likelihood can be obtained 
from the previous general results about GMM estimation. 


The Score and Its Population Properties 


Let y, denote an (m x 1) vector of variables observed at date t, and let ¥, = 


(¥;, ¥r-15 - ++ ¥1)’ denote the full set of data observed through date ¢. Suppose 
that the conditional density of the ¢th observation is given by 
F(ylY,— 15 8)- [14.4.1] 
Since [14.4.1] is a density, it must integrate to unity: 
[Folens 8) dy, = 1, [14.4.2] 


where denotes the set of possible values that y, could take on and Jf dy, denotes 
multiple integration: 


[ a) dy, = { | wee [ 20 Yar +s Yne) Wire War + Bap 


Since [14.4.2] holds for all admissible values of @, we can differentiate both sides 
with respect to @ to conclude that 

af(ylY.—1; @) 

a dy, = 0. 4. 

[, 0 y, = 0 [14.4.3] 

The conditions under which the order of differentiation and integration can be 
reversed as assumed in arriving at [14.4.3] and the equations to follow are known 
as “regularity conditions” and are detailed in Cramér (1946). Assuming that these 
hold, we can multiply and divide the integrand in [14.4.3] by the conditional density 
of y,: 


af(y|Y,-1; 6) 1 
Se ——— ——— Fy, |8,_ 1; 0) dy, = 0, 
oe FOV) LOI -15 ®) ay 


or 
alo 1,13 8 
[ i TBE Foul =) ®) ary jay,_4; 8) dy, = 0. [14.4.4] 
Let h(@, Y,) denote the derivative of the log of the conditional density of the 
tth observation: 
@ ea LOB @) [14.4.5] 


If there are a elements in @, then [14.4.5] describes an (a x 1) vector for each 
date t that is known as the score of the tth observation. Since the score is a function 
of Y,, it is a random variable. Moreover, substitution of [14.4.5] into [14.4.4] reveals 


h(@, Y,) = 


14.4. GMM and Maximum Likelihood Estimation 427 


that 


i‘ h(O, Y,)F(y:|Y,-15 @) dy, = 0. [14.4.6] 


Equation [14.4.6] indicates that if the data were really generated by the density 
[14.4.1], then the expected value of the score conditional on information observed 
through date ¢ — 1 should be zero: 


E{h(0, Y,)|¥,_} = 0. [14.4.7] 


In other words, the score vectors {h(@, Y,)}7_, should form a martingale difference 
sequence. This observation prompted White (1987) to suggest a general specifi- 
cation test for models estimated by maximum likelihood based on whether the 
sample scores appear to be serially correlated. Expression [14.4.7] further implies 
that the score has unconditional expectation of zero, provided that the uncondi- 
tional first moment exists: 


E{h(0, Y,)} = 0. [14.4.8] 


Maximum Likelihood and GMM 


Expression [14.4.8] can be viewed as a set of a orthogonality conditions that 
could be used to estimate the @ unknown elements of @. The GMM principle 
suggests using as an estimate of @ the solution to 


0 = (1/T) : h(0, Y,). [14.4.9] 


But this is also the characterization of the maximum likelihood estimate, which is 
based on maximization of 


T 
£0) = D log f(y 1%,—13 8), 


the first-order conditions for which are 


> a log F(y,|Y,—13 6) = 


0, 4.4.10 
f= 00 S 


assuming an interior maximum. Recalling [14.4.5], observe that [14.4.10] and [14.4.9] 
are identical conditions—the MLE is the same as the GMM estimator based on 
the orthogonality conditions in [14.4.8]. 

The GMM formula [14.2.6] suggests that the variance-covariance matrix of 
the MLE can be approximated by 


E[(67 — @)(67 — 6)'] = (V/T)D7S7'Bz}-', [14.4.11] 
where 
A ag(@; 7) 
D> = >, + 
ie 00 o=a; 
T 
= (UT) > one.) [14.4.12] 
tet 38 e=67 
T 2 | . 
= (Ty & Slee fulB58)| 
tol 38 08 e=6r 


428 Chapter 14 | Generalized Method of Moments 


Moreover, the observation in [14.4.7] that the scores are serially uncorrelated 
suggests estimating S by 


§, = (1/T) > [h(6, ,)][h(6, %,)]’. [14.4.13] 


The Information Matrix Equality 


Expression [14.4.12] will be recognized as —1 times the second derivative 
estimate of the information matrix. Similarly, expression [14.4.13] is the outer- 
product estimate of the information matrix. That these two expressions are indeed 
estimating the same matrix if the model is correctly specified can be seen from 
calculations similar to those that produced [14.4.6]. Differentiating both sides of 
[14.4.6] with respect to @’ reveals that 


0= [ PO Fy 1, 6) dy, + [. h(, Y,) FG 1%.—1 ? 
7 [, OM) Fey 1, bb 6) dy, 
I AK 
4 i: h(0, Y,) 2 Nog Fal) rey, sav 
or 
f, [n(@, Y,)][h(O, Y,)]'F(y,|Y,-13 6) dy, = — [ no.) plank eae 


This equation implies that if the model is correctly specified, the expected value 
of the outer product of the vector of first derivatives of the log likelihood is equal 
to the negative of the expected value of the matrix of second derivatives: 


a log f(y,|¥,_ 13 @) |} 9 log f(y, |%,-15 6) 
e{| 20 Il 20" Jv} 


# log f(y |¥,- b @) 
e| rer y | [14.4.14] 


= §F,. 
Expression [14. 4.14] is known as the information matrix equality. Assuming that 
(1/T) 27. F,-> $, a positive definite matrix, we can reasonably expect that for 
many models, the estimate $, in [14. 4.13] converges in probability to the infor- 
mation matrix § and the estimate Dj; in [14.4.12] converges in probability to 
—§. Thus, result [14.4.11] suggests that if the data are stationary and the estimates 
do not fall on the boundaries of the allowable parameter space, it will often be the 
case that L 
VT(6r — 0) > N(8, $~'), [14.4.15] 

where the information matrix # can be estimated consistently from either —D/ in 
[14.4.12] or §; in [14.4.13]. 

In small samples, the estimates — D7 and §; will differ, though if they differ 
too greatly this suggests that the model may be misspecified. White (1982) devel- 
oped an alternative specification test based on comparing these two magnitudes. 


The Wald Test for Maximum Likelihood Estimates 


Result [14.4.15] suggests a general approach to testing hypotheses about the 
value of a parameter vector @ that has been estimated by maximum likelihood. 


14.4. GMM and Maximum Likelihood Estimation 429 


Consider a null hypothesis involving m restrictions on @ represented as g(@) = 
where g: R* — R” is a known differentiable function. The Wald test of this hy- 


pothesis is given by 
ryt 
a ;'| 2 | } [2(6,)], [14.4.16] 


6g(@ 
ria) {| 2 
(xm) (xa) taxa) (aX m) (mx) 


which converges in distribution to a y?(m) variable under the null hypothesis. 
Again, the estimate of the information matrix $; could be based on either -p; 
in [14.4.12] or §; in [14.4.13]. 


The Lagrange Multiplier Test 


We have seen that if the model is correctly specified, the scores 
{h(®), Y,)}, often form a martingale difference sequence. Expression [14.4.14] 
indicates that the conditional variance-covariance matrix of the fth score is given 
by ¥,. Hence, typically, 


lun h(8,, 2] az'{ un z h(Q, 2] 5 (a). [14.4.17] 


Expression [14.4.17] does not hold when @, is replaced by 6,, since, from [14.4.9], 
this would cause [14.4.17] to be identically zero. 

Suppose, however, that the likelihood function is maximized subject to m 
constraints on @, and let 6, denote the restricted estimate of @. Then, as in the 
GMM test for overidentifying restrictions [14.1.27], we would expect that 


ram > h(6,, 2] [um S h(6,, 2 | + y(m). [14.418] 


The magnitude in [14.4.18] was called the efficient score statistic by Rao (1948) 
and the Lagrange multiplier test by Aitchison and Silvey (1958). It provides an 
extremely useful class of diagnostic tests, enabling one to estimate a restricted 
model and test it against a more general specification without having to estimate 
the more general model. Breusch and Pagan (1980), Engle (1984), and Godfrey 
(1988) illustrated applications of the usefulness of the Lagrange multiplier principle. 


Quasi-Maximum Likelihood Estimation 


Even if the data were not generated by the density f(y,|%,_,; @), the or- 
thogonality conditions [14.4.8] might still provide a useful description of the pa- 
rameter vector of interest. For example, suppose that we incorrectly specified that 
a scalar series y, came from a Gaussian AR(1) process: 


log f(y,|¥,- 1; 8) = —3 log(2) ~ 2 log(o) ~ (y, — dy,-1)*/(20°), 
with @ = (, a7)’. The score vector is then 


(Y: ~ bY1-DY-1/0? 
h(6, Y,) = Be + (y,- ee 


which has expectation zero whenever 


El(y, — $y.-1)y,-1] = 0 [14.4.19] 
E[(y ~ $y,-1)"] = 0. [14.4.20] 


430 Chapter 14 | Generalized Method of Moments 


The value of the parameter ¢ that satisfies [14.4.19] corresponds to the coefficient 
of a linear projection of y, on y,_, regardless of the time series process followed 
by y,, while o? in [14.4.20] is a general characterization of the mean squared error 
of this linear projection. Hence, the moment conditions in [14.4.8] hold for a broad 
class of possible processes, and the estimates obtained by maximizing a Gaussian 
likelihood function (that is, the values satisfying [14.4.9]) should give reasonable 
estimates of the linear projection coefficient and its mean squared error for a fairly 
general class of possible data-generating mechanisms. 

However, if the data were not generated by a Gaussian AR(1) process, then 
the information matrix equality no longer need hold. As long as the score vector 
is serially uncorrelated, the variance-covariance matrix of the resulting estimates 
could be obtained from [14.4.11]. Proceeding in this fashion—maximizing the like- 
lihood function in the usual way, but using [14.4.11] rather than [14.4.15] to cal- 
culate standard errors—was first proposed by White (1982), who described this 
approach as quasi-maximum likelihood estimation.> 


APPENDIX 14.A. Proof of Chapter 14 Proposition 


@ Proof of Proposition 14.1. Let g,(@;%,) denote the ith element of g(@; Y,), so that 
g; R’ > R'. By the mean-value theorem, 


8(O7; Ur) = 8,805 x) + [d,(027; Y7)]'(r — 4), [14.4.1] 
where 
d(0@*;; y,) = g(8; 7) 
(axt) 00 ‘ 
0=6,7 
for some 0%, between @, and 6,; notice that d,: R¢ > R”. Define 
[d,(O7.7; Y7)]’ 
[d.(03.7; Y;)]' : 


(4,077; Y,)}" 
Stacking the equations in (14.A.1] in an (r x 1) vector produces 

2(6,; %,) = g(8,; %,) + D;(6, — 0). [14.4.3] 
If both sides of [14.4.3] are premultiplied by the (a x r) matrix 


{za Y,) f x 85!, 


00’ 
} x $7! x (2(6,; 9,)] 


D, = 


(xu) 


[14.4.2] 


the result is 


ag(8; Yr) 
30" 


o=6,. 


“ {2s Yr) 


ai } x $F! x [g(00; Y7)] [14.4.4] 


A=67. 


g(8; Y7) 
oer ae 


} x 8! x Di(6; — 6). 


0=6; 


3For further discussion, see Gourieroux, Monfort, and Trognon (1984), Gallant and White (1988), 
and Wooldridge (1991a, b). 


Appendix 14.A. Proof of Chapter 14 Proposition 431 


But equation [14.1.22] implies that the left side of [14.A.4] is zero, so that 


3g(8; 97) } go a 
0, 8.) = — f “ a7 * ve 
( ) { 30 pare [14.4.5] 
{2ste2e i x $=! x [2(8.; Y7)]- 


Now, 07; in [14.4.1] is between 0, and 6,, so that 0*,—> 0) for each i. Thus, condition 
(c) ensures that each row of D7 converges in probability to the corresponding row of D’. 
Then [14.A.5] implies that 
V7(8, — 0) > —{DS-'D'}~! x {DS-'"\VT- 9(0,; Y,)}. [14.4.6] 
Define 
C = —{DS-'D'}' x DS-', 
so that [14.A.6] becomes 
VT(8, — ) > CVT- (805 Yr). 
Recail from condition (b) of the proposition that 
VT 9(81; 97) > N(@, S). 
It follows as in Example 7.5 of Chapter 7 that 
| VI; - %) > NO, V), [14.4.7] 
where 
V = CSC’ = {DS-'D'}-'DS-' x S x S“'D'{DS-'D'}-! = {DS-'D’‘}-', 
as claimed, 


Chapter 14 Exercise 


14.1, Consider the Gaussian linear regression model, 

y, = xB + u,, 
with u, ~ i.i.d. N(0, 0”) and u, independent of x, for all ¢and 7. Define @ = (B', a7)’. The 
log of the likelihood of (y,, ¥2, ... , yr) conditional on (x,, X:,... , Xr) is given by 


e 
£(8) = —(Ti2) log(2m) — (TR) log(a*) — 2 (y, — x; B)*/(20°). 
(a) Show that the estimate D}-in [14.4.12] is given by 


1 T, 
PD 4G 0 

, 14f1 a?) |’ 
. 7 {i - #} 


where ii, = (y, — x/B,) and Br and &} denote the maximum likelihood estimates. 


by = 


(b) Show that the estimate § , in [14.4.13] is given by 


T LS 3 
Eoanxe 43 {ex} 


T A | 26 


$7 = 2 
1s {axl 1p fe _ 1) 
TS, (266) TH 264 26% 


(c) Show that plim(S;) = —plim(D,) = %, where 
io? 0 
Pele nel 
0 = 1/20") 


432 Chapter 14 | Generalized Method of Moments 


for Q = plim(1/T) 27_,x,x;. 


(d) Consider a set of m linear restrictions on B of the form RB = r for R a known 
(m X k) matrix and r a known (m X 1) vector. Show that for $, = —B,, the Wald test 
statistic given in [14.4.16] is identical to the Wald form of the OLS y? test in [8.2.23] with 
the OLS estimate of the variance s3 in [8.2.23] replaced by the MLE 63. 

(e) Show that when the lower left and upper right blocks of 8, are set to their plim 
of zero, then the quasi-maximum likelihood Wald test of RB = r is identical to the hetero- 
skedasticity-consistent form of the OLS y? test given in [8.2.36]. 


Chapter 14 References 


Aitchison, J., and S. D. Silvey. 1958. ‘Maximum Likelihood Estimation of Parameters 
Subject to Restraints.” Annals of Mathematical Statistics 29:813-28. 

Amemiya, Takeshi. 1974. “The Nonlinear Two-Stage Least-Squares Estimator,” Journal of 
Econometrics 2:105—10. 

Andrews, Donald W. K. 1991. ‘‘Heteroskedasticity and Autocorrelation Consistent Co- 
variance Matrix Estimation.’ Econometrica 59:817-58. 

. 1993. “Tests for Parameter Instability and Structural Change with Unknown Change 
Point.” Econometrica 61:821—56. 

and Ray C. Fair. 1988. “Inference in Nonlinear Econometric Models with Structural 
Change.”’ Review of Economic Studies 55:615—40, 

and J, Christopher Monahan. 1992. ‘‘An [Improved Heteroskedasticity and Auto- 
correlation Consistent Covariance Matrix Estimator.” Econometrica 60:953—-66. 

Bates, Charles, and Halbert White. 1988. ‘Efficient Instrumental Variables Estimation of 
Systems of Implicit Heterogeneous Nonlinear Dynamic Equations with Nonspherical Er- 
rors,” in William A. Barnett, Ernst R. Berndt, and Halbert White, eds., Dynamic Econ- 
ometric Modeling. Cambridge, England: Cambridge University Press. 

Breusch, T. S., and A, R. Pagan. 1980. “The Lagrange Multiplier Test and [ts Applications 
to Model Specification in Econometrics.” Review of Economic Studies 47:239-53. 
Cramér, H. 1946, Mathematical Methods of Statistics. Princeton, N.J.: Princeton University 
Press. 

Engle, Robert F, 1984. ‘Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econ- 
ometrics,”” in Zvi Griliches and Michael D, Intriligator, eds., Handbook of Econometrics, 
Vol. 2. Amsterdam: North-Holland. 

Ferguson, T. S. 1958. “A Method of Generating Best Asymptotically Normal Estimates 
with Application to the Estimation of Bacterial Densities.” Annals of Mathematical Statistics 
29:1046-62. 

Gallant, A. Ronald. 1977. ‘‘Three-Stage Least-Squares Estimation for a System of Simul- 
taneous, Nonlinear, Implicit Equations.” Journal of Econometrics 5:71-88. 

. 1987. Nonlinear Statistical Models. New York: Wiley. 

and George Tauchen. 1992. ‘Which Moments to Match?” Duke University. Mimeo. 
and Halbert White. 1988. A Unified Theory of Estimation and Inference for Nonlinear 
Dynamic Models, Oxford: Blackwell. 

Garber, Peter M., and Robert G. King. 1984. “Deep Structural Excavation? A Critique of 
Euler Equation Methods.” University of Rochester. Mimeo. 

Ghysels, Eric, and Alastair Hall. 1990a. ‘*A Test for Structural Stability of Euler Conditions 
Parameters Estimated via the Generalized Method of Moments Estimator.” International 
Economic Review 31:355-64. 

and . 1990b. ‘Are Consumption-Based Intertemporal Capital Asset Pricing 
Models Structural?” Journal of Econometrics 45:121-39. 

Godfrey, L. G. 1988. Misspecification Tests in Econometrics: The Lagrange Multiplier Prin- 
ciple and Other Approaches. Cambridge, England: Cambridge University Press. 
Gourieroux, C., A. Monfort, and A. Trognon. 1984. “Pseudo Maximum Likelihood Meth- 
ods: Theory.” Econometrica 52:681—700. , 

Hall, Alastair. 1993. ‘‘Some Aspects of Generalized Method of Moments Estimation,” in 
C. R. Rao, G. S. Maddala, and H. D. Vinod, eds., Handbook of Statistics, Vol. 11, 
Econometrics, Amsterdam: North-Holland. 


Chapter 14 References 433 


Hansen, Lars P. 1982. “Large Sample Properties of Generalized Method of Moments Es- 
timators.” Econometrica 50:1029—54. 

and Kenneth J. Singleton. 1982. “Generalized Instrumental Variables Estimation of 
Nonlinear Rational Expectations Models.” Econometrica 50:1269—86. Errata: Econometrica 
52:267-68. 

Jorgenson, D. W., and J. Laffont. 1974. “Efficient Estimation of Nonlinear Simultaneous 
Equations with Additive Disturbances.” Annals of Economic and Social Measurement 3:615— 
40. 

Kocherlakota, Narayana R. 1990. ‘“‘On Tests of Representative Consumer Asset pee 
Models.” Journal of Monetary Economics 26:285—304. 

Malinvaud, E. 1970. Statistical Methods of Econometrics. Amsterdam: North-Holland. 
Nelson, Charles R., and Richard Startz. 1990. “‘Some Further Results on the Exact Small 
Sample Properties of the Instrumental Variable Estimator.” Econometrica 58:967—76. 
Newey, Whitney K. 1985. ‘‘Generalized Method of Moments Specification Testing.” Journal 
of Econometrics 29:229-56. 

and Kenneth D. West. 1987. ‘‘A Simple Positive Semi-Definite, Heteroskedasticity 
and Autocorrelation Consistent Covariance Matrix.” Econometrica 55:703-8. 

Ogaki, Masao. 1993. ‘Generalized Method of Moments: Econometric Applications,” in 
G. S. Maddala, C. R. Rao, and H. D. Vinod, eds., Handbook of Statistics, Vol. 11, 
Econometrics. Amsterdam: North-Holland. 

Pearson, Karl. 1894. “Contribution to the Mathematical Theory of Evolution.” Philosophical 
Transactions of the Royal Society of London, Series A 185:71-110. 

Rao, C. R. 1948. “Large Sample Tests of Statistical Hypotheses Concerning Several Pa- 
rameters with Application to Problems of Estimation.” Proceedings of the Cambridge Phil- 
osophical Society 44:50-—57. 

Rothenberg, Thomas J. 1973. Efficient Estimation with A Priori Information, New Haven, 
Conn.: Yale University Press. 

Sargent, Thomas J. 1987. Dynamic Macroeconomic Theory. Cambridge, Mass.: Harvard 
University Press. 

Sill, Keith. 1992. Money in the Cash-in-Advance Model: An Empirical Implementation. 
Unpublished Ph.D. dissertation, University of Virginia. 

Tauchen, George. 1986. ‘‘Statistical Properties of Generalized Method-of-Moments Esti- 
mators of Structural Parameters Obtained from Financial Market Data.” Journal of Business 
and Economic Statistics 4397-416. 

White, Halbert. 1980, “A Heteroskedasticity-Consistent Covariance Matrix Estimator and 
a Direct Test for Heteroskedasticity.”” Econometrica 48:817-38. 

. 1982, “Maximum Likelihood Estimation of Misspecified Models.” Econometrica 
50:1-25. 

. 1987. “Specification Testing in Dynamic Models,” in Truman F, Bewley, ed., 
Advances in Econometrics, Fifth World Congress, Vol. 11. Cambridge, England: Cambridge 
University Press. 

Wooldridge, Jeffrey M. 1991a. ‘On the Application of Robust, Regression-Based Diag- 
nostics to Models of Conditional Means and Conditional Variances.” Journal of Econo- 
metrics 47:5—46. 

. 1991b. “Specification Testing and Quasi-Maximum Likelihood Estimation.” Journal 
of Econometrics 48:29-55. 


434 Chapter 14 | Generalized Method of Moments 


15 
Models 
of Nonstationary 


Time Series 


Up to this point our analysis has typically been confined to stationary processes. 
This chapter introduces several approaches to modeling nonstationary time series 
and analyzes the dynamic properties of different models of nonstationarity. Con- 
sequences of nonstationarity for statistical inference are investigated in subsequent 
chapters. 


15.1. Introduction 


Chapters 3 and 4 discussed univariate time series models that can be written in the 
form 

Y= ete + WG-1 + hea t+ --7 = wt WLy)e, (15.1.1] 
where Z7.9|%;| < », roots of y(z) = 0 are outside the unit circle, and {e} is a 
white noise sequence with mean zero and variance o?. Two features of such proc- 


esses merit repeating here. First, the unconditional expectation of the variable is 
a constant, independent of the date of the observation: 


E(y,) = p. 
Second, as one tries to forecast the series farther into the future, the forecast 
Jrrs = E(y:4s|¥n Yr-1» - - -) converges to the unconditional mean: 


lim Vissi = pe 


soe 

These can be quite unappealing assumptions for many of the economic and 
financial time series encountered in practice. For example, Figure 15.1 plots the 
level of nominal gross national product for the United States since World War II. 
There is no doubt that this series has trended upward over time, and this upward 
trend should be incorporated in any forecasts of this series. 

There are two popular approaches to describing such trends. The first is to 
include a deterministic time trend: 


y =at 6+ WLe,. (15.1.2] 
Thus, the mean yz of the stationary! process [15.1.1] is replaced by a linear function 
of the date ¢. Such a process is sometimes described as trend-stationary, because if 


one subtracts the trend a + 6&¢ from [15.1.2], the result is a stationary process. 
The second specification is a unit root process, . 


(1 -— L)y, = 6 + p(L)e, [15.1.3] 
‘Recall that “stationary” is taken to mean “covariance-stationary.” 


435 


S000 


4000 


47, Sl 55 S59 «6637 73 #79 #83 87 
FIGURE 15.1 U.S. nominal GNP, 1947-87. 


where (1) # 0. For a unit root process, a stationary representation of the form 
of [15.1.1] describes changes in the series. For reasons that will become clear 
shortly, the mean of (1 — L)y, is denoted 6 rather than pw. 

The first-difference operator (1 — L) will come up sufficiently often that a 
special symbol (the Greek letter A) is reserved for it: 


Ay, =y — V-1- 

The prototypical example of a unit root process is obtained by setting y(L) 

equal to 1 in [15.1.3]: 
Y= Yn. +O + &. (15.1.4] 

This process is known as a random walk with drift 8. 

In the definition of the unit root process in [15.1.3], it was assumed that (1) 
is nonzero, where (1) denotes the polynomial 

Wz) = 1+ dre! + Yast to 
evaluated at z = 1. To see why such a restriction must be part of the definition 
of a unit root process, suppose that the original series y, is in fact stationary with 
a representation of the form 
y= et x(L)e,. 
If such a stationary series is differenced, the result is 
(i — L)y, = 1 — L)x(L)e, = w(L)e,, 

where (L) = (1 — L)x(L). This representation is in the form of [15.1.3]—if the 
original series y, is stationary, then so is Ay,. However, the moving average operator 
¥(L) that characterizes Ay, has the property that ¥(1) = (1 — 1)-y(1) = 0. When 
we stipulated that y(1) # 0 in [15.1.3], we were thus ruling out the possibility 
that the original series y, is stationary. 

It is sometimes convenient to work with a slightly different representation of 


436 Chapter 15 | Models of Nonstationary Time Series 


the unit root process [15.1.3]. Consider the following: specification: 


y, = at 6+ u,, {15.1.5} 
where u, follows a zero-mean ARMA process: 
= —_ Peres, ei 8, sie, P 
(l- $b - @& $,L”)u, 5 {15.1.6} 
=(1+ 60 + @L?7+---+ @L%e, 
and where the moving average operator (1 + @L + @L? + +--+ + 6,L*) is 


invertible. Suppose that the autoregressive operator in [15.1.6] is factored as in 
equation [2.4.3]: 

(1 — $L — gL? — +++ — $b?) = (1 — AL) — AL) ++ (1 - A,L). 
If all of the eigenvalues A,, A2,.. . , A, are inside the unit circle, then [15.1.6] can 
be expressed as 
ee . 
bd = AEs ee aL 
with Tqlyj| < 2% and roots of w(z) = 0 outside the unit circle. Thus, when 
|A,| < 1 for all i, the process [15.1.5] would just be a special case of the trend- 
stationary process of [15.1.2]. 

Suppose instead that A, = 1 and|[A,| < 1 fori = 2,3,..., p. Then [15.1.6] 
would state that 
(1 - L)(L — AgL)(1 — AgL)--- (1 — A, L)u, 
= (1+ OL + 6,L? +--+ + 6,L%e,, 


u, 


[15.1.7] 


implying that 


1 + OL + 6,L? + as od 6,L4 *(L) 
G30 = ab) = dey ee 
with 27.y|¢7| < % and roots of y*(z) = 0 outside the unit circle. Thus, if [15.1.5] 
is first-differenced, the result is 


(i - L)y, = (1 -— Liaw + [6 -— 6 - 1) + (A - L)u, = 0+ 6 + w(Lye,, 


which is of the form of the unit root process [15.1.3]. 

The representation in [15.1.5] explains the use of the term “unit root process.” 
One of the roots or eigenvalues (A,) of the autoregressive polynomial in [15.1.6] 
is unity, and all other eigenvalues are inside the unit circle. 

Another expression that is sometimes used is that the process [15.1.3] is 
integrated of order 1. This is indicated as y, ~ /(1). The term “integrated”’ comes 
from calculus; if dy/dt = x, then y is the integral of x. In discrete time series, if 
Ay, = x,, then y might also be viewed as the integral, or sum over ¢, of x. 

If a process written in the form of [15.1.5] and [15.1.6] has two eigenvalues 
A, and A, that are both equal to unity with the others all inside the unit circle, 
then second differences of the data have to be taken before arriving at a stationary 
time series: 


(i — L)u, = 


(1 — L)*y, = x + p(L)e,. 


Such a process is said to be integrated of order 2, denoted y, ~ /(2). 

A general process written in the form of [15.1.5] and [15.1.6] is called an 
autoregressive integrated moving average process, denoted ARIMA(p, d, q). The 
first parameter (p) refers to the number of autoregressive lags (not counting the 
unit roots), the second parameter (d) refers to the order of integration, and the 
third parameter (q) gives the number of moving average lags. Taking dth differences 
of an ARIMA(p, d, q) produces a stationary ARMA(p, q) process. 


15.1. Introduction 437 


15.2. Why Linear Time Trends and Unit Roots? 


One might wonder why, for the trend-stationary specification [15.1.2], the trend 
is specified to be a linear function of time (6f) rather than a quadratic function 
(St + yt?) or exponential (e*). Indeed, the GNP series in Figure 15.1, like many 
economic and financial time series, seems better characterized by an exponential 
trend than a linear trend. An exponential trend exhibits constant proportional 
growth; that is, if 

y, = e, [15.2.1] 


then dy/dt = &-y,. Proportional growth in the population would arise if the number 
of children born were a constant fraction of the current population. Proportional 
growth in prices (or constant inflation) would arise if the government were trying 
to collect a constant level of real revenues from printing money. Such stories are 
often an appealing starting point for thinking about the sources of time trends, and 
exponential growth is often confirmed by the visual appearance of the series as in 
Figure 15.1. For this reason, many economists simply assume that growth is of the 
exponential form. 

Notice that if we take the natural log of the exponential trend [15.2.1], the 
result is a linear trend, 


log(y,) = 6. 


Thus, it is common to take logs of the data before attempting to describe them 
with the model in [15.1.2]. 

Similar arguments suggest taking natural logs before applying [15.1.3]. For 
small changes, the first difference of the log of a variable is approximately the same 
as the percentage change in the variable: 


(1 — L) log(y,) 


log(y,/y,-1) 
log{i + [(y, — y-/¥-1}} 
= (y, — Yalan, 


where we have used the fact that for x close to zero, log(1 + x) = x.? Thus, if the 
logs of a variable are specified to follow a unit root process, the assumption is that 
the rate of growth of the series is a stationary stochastic process. The same argu- 
ments used to justify taking logs before applying [15.1.2] also suggest taking logs 
before applying [15.1.3]. ; 

Often the units are slightly more convenient if log(y,) is multiplied by 100. 
Then changes are measured directly in units of percentage change. For example, 
if (1 — L)[100 x log(y,)] = 1.0, then y, is 1% higher than y,_,. 


15.3. Comparison of Trend-Stationary 
and Unit Root Processes 


This section compares a trend-stationary process [15.1.2] with a unit root process 
[15.1.3] in terms of forecasts of the series, variance of the forecast error, dynamic 
multipliers, and transformations needed to achieve stationarity. 


*See result [A.3.36] in the Mathematical Review (Appendix A) at the end of the book. 


438 Chapter 15 | Models of Nonstationary Time Series 


Comparison of Forecasts 


To forecast a trend-stationary process [15.1.2], the known deterministic com- 
ponent (a + 62) is simply added to the forecast of the stationary stochastic com- 
ponent: 


Diese =at &t + s) + WE, + Wee Ee + Wse2G-2 tc. (15.3.1] 


Here $,,.1, denotes the linear projection of y,,, on a constant and y,, y,.;,.... 
Note that for nonstationary processes, we will follow the convention that the “‘con- 
stant” term in a linear projection, in this case a + &(f + s), can be different for 
each date ¢ + s. As the forecast horizon (s) grows large, absolute summability of 
{ys} implies that this forecast converges in mean square to the time trend: 
Els 7 @~ 6 + s)P>O as sm, 
To forecast the unit root process [15.1.3], recall that the change Ay, is a 
stationary process that can be forecast using the standard formula: 
AD aste a Eras 7 Yias- Ve Yi-as eee ] (15 3 2] 
= 6 ot WE, a3 Wy 4181 + Ws 42E-2 Hv ais 


The level of the variable at date ¢ + s is simply the sum of the changes between ¢ 
andf + s: 

Yt+s = as = Yity-1) a Nast = Yras-2) ha 
+ (Wer —- Wty (15.3.3] 
AYras + AWas-1 t+ + Aa + 


N 


Taking the linear projection of [15.3.3] on a constant and y,, y,.,, .. . and sub- 
stituting from [15.3.2] gives 
Dist = AVraste + AD as —14e Hee Ara ye + y; 
{6 + WE + W418 —1 + We428—-2 + °°} 
+ {8 + Were) + Yor + Wea rei-2 t+ °°} 
tr FE + Wye, + Woe-1 + WG-2 tb + Y 


It 


or 
Verste = sity, + (by t+ Wer tooo t+ Wide, 


15.3.4 
+ (Wear t Ue torre +t Wey toc. 


Further insight into the forecast of a unit root process is obtained by analyzing 
some special cases. Consider first the random walk with drift [15.1.4], in which 
wh = w= -°: = 0. Then [15.3.4] becomes 


Ves = sé + y,. 


A random walk with drift 6 is expected to grow at the constant rate of 6 per period 
from whatever its current value y, happens to be. 

Consider next an ARIMA(0, 1, 1) specification (¥, = 6, #2 = #3 = °° 
= 0). Then 


Sic = 88 + y, + 68). [15.3.5] 


Here, the current level of the series y, along with the current innovation «, again 
defines a base from which the variable is expected to grow at the constant rate 6. 


15.3. Comparison of Trend-Stationary and Unit Root Processes 439 


Notice that ¢, is the one-period-ahead forecast error: 


& = y, — Yana 
It follows from [15.3.5] that for 6 = O ands = 1, 
Dex ite = Yr + OY — Iya) [15.3.6] 
or 
Dai = (+ OY ~ Hyy-1- [15.3.7] 


Equation [15.3.7] takes the form of a simple first-order difference equation, relating 
Jr+1y to its own lagged value and to the input variable (1 + 6)y,. Provided that 
|@| < 1, expression [15.3.7] can be written using result [2.2.9] as 


Drarle = {a + 9) y,] + (- [C1 + )y,-1] 
+ (-O7[((1 + 6) y,-2] + (-97[(1 + Oy,-3] + °° [15.3.8] 


= (1 + 4) 2 (-6)/y,-;- 


Expression [15.3.7] is sometimes described as adaptive expectations, and its impli- 
cation [15.3.8] is referred to as exponential smoothing; typical applications assume 
that -1 < @ < 0, Letting y, denote income, Friedman (1957) used exponential 
smoothing to construct one of his measures of permanent income. Muth (1960) 
noted that adaptive expectations or exponential smoothing corresponds to a rational 
forecast of future income only if y, follows an ARIMA(0, 1, 1) process and the 
smoothing weight (— 6) is chosen to equal the negative of the moving average 
coefficient of the differenced data (6). 

For an ARIMA(0, 1, g) process, the value of y, and the g most recent values 
of «, influence the forecasts §,4 11, J,+21 - ++» Sreq, Dut thereafter the series is 
expected to grow at the rate 6. For an ARIMA(p, 1, q), the forecast growth rate 
asymptotically approaches 6. 

Thus, the parameter 6 in the unit root process [15.1.3] plays a similar role to 
that of 6 in the deterministic time trend [15.1.2]. With either specification, the 
forecast 9,4.4, in [15.3.1] or [15.3.4] converges to a linear function of the forecast 
horizon s with slope 6; see Figure 15.2. The key difference is in the intercept of 
the line. For a trend-stationary process, the forecast converges to a line whose 
intercept is the same regardless of the value of y,. By contrast, the intercept of the 
limiting forecast for a unit root process is continually changing with each new 
observation on y. 


Comparison of Forecast Errors 


The trend-stationary and unit root specifications are also very different in 
their implications for the variance of the forecast error. For the trend-stationary 
process [15.1.2], the s-period-ahead forecast error is 

Vies — Srasu = {a + SCE + 8) + S45 + WiSias-1 + Wob4s-2 tO 
+ Weber + Wee + WesrG1 + °°} 
— fa + 86 + 8) + Wee, t+ Wea 1 + We428—-2 + °°} 
= B45 + Weres-1 + Prbrss-2 to + Wea 


440 Chapter 15 | Models of Nonstationary Time Series 


The mean squared error (MSE) of this forecast is 
Elyiss 7 Persil = {1 + Wi aa WB be st he We }o. 


The MSE increases with the forecasting horizon s, though as s becomes large, the 
added uncertainty from forecasting farther into the future becomes negligible: 


lim Elyiss = Drastel? = {1 + yi + Wi ae es ‘jo?. 


Note that the limiting MSE is just the unconditional variance of the stationary 
component ¥(L)e,. 

By contrast, for the unit root process [15.1.3], the s-period-ahead forecast 
error is 


forecast ow cant 


95% confidence interval 


Time 
(a) Trend-stationary process 


forecast. “Wo ete 
on 


95% confidence interval 


ee” 


n=nee 


Time 


(b) Unit root process 
FIGURE 15.2 Forecasts and 95% confidence intervals. 


15.3. Comparison of Trend-Stationary and Unit Root Processes 441 


Yt+s ~ Draste = {Ay,.. t+ AYiyser tor + Aya + yt 
{AY 4540 + AY as—afe don ap AY sie + yt 


= {E45 + Waser tort Ws 18:4 f 
+ {Erp9-1 + WiSers-2 Hoo + Yeah te + fait 
= G45 + {1 + Wieser t {lh + ti t+ hese toe 
tft abt deter t+ wae 
with MSE 
Elyse ~ Saul = 1+ (lt wrt (ltt wyrtoe: 
+ (Lt di t+ do t+ + Ys-1)*}o?. 


The MSE again increases with the length of the forecasting horizon s, though in 
contrast to the trend-stationary case, the MSE does not converge to any fixed value 
as s goes to infinity. Instead, it asymptotically approaches a linear function of s 
with slope (1 + w, + Ww. + °° *)’o?. For example, for an ARIMA(0, 1, 1) process, 


ElYers — Versi’ = {L + (s — IC + 6)?}o*. [15.3.9] 


To summarize, for a trend-stationary process the MSE reaches a finite bound 
as the forecast horizon becomes large, whereas for a unit root process the MSE 
eventually grows linearly with the forecast horizon. This result is again illustrated 
in Figure 15.2. 

Note that since the MSE grows linearly with the forecast horizon s, the 
standard deviation of the forecast error grows with the square root of s. On the 
other hand, if 6 > 0, then the forecast itself grows linearly in s. Thus, a 95% 
confidence interval for y,,, expands more slowly than the level of the series, 
meaning that data from a unit root process with positive drift are certain to exhibit 
an upward trend if observed for a sufficiently long period. In this sense the trend 
introduced by a nonzero drift 6 asymptotically dominates the increasing variability 
arising over time due to the unit root component. This result is very important for 
understanding the asymptotic statistical results to be presented in Chapters 17 and 
18. 

Figure 15.3 plots realizations of a Gaussian random walk without drift and 
with drift. The random walk without drift, shown in panel (a), shows no tendency 
to return to its starting value or any unconditional mean. The random walk with 
drift, shown in panel (b), shows no tendency to return to a fixed deterministic 
trend line, though the series is asymptotically dominated by the positive drift term. 


Comparison of Dynamic Multipliers 


Another difference between trend-stationary and unit root processes is the 
persistence of innovations. Consider the consequences for y,,, if ¢, were to increase 
by one unit with e’s for all other dates unaffected. For the trend-stationary process 
(15.1.2], this dynamic multiplier is given by 


OVr4s 
0g, 


= Ws. 
For a trend-stationary process, then, the effect of any stochastic disturbance even- 
tually wears off: 


te) 
lim 8 = 0, 
sox OE, 


442 Chapter 15 | Models of Nonstationary Time Series 


—2 


-4 


(a) Random walk without drift 


60 
50 
40 
30 


20 


(b) Random walk with drift 


FIGURE 15.3 Sample realizations from Gaussian unit root processes. 


By contrast, for a unit root process, the effect of ¢, on y,,, is seen from [15.3.4] 
to be? 


C) 0 
St te ttl th tht th 
0g, de, 
An innovation ¢, has a permanent effect on the level of y that is captured by 
jin ed ga eg: (15.3.10] 
sox 08, 


‘This, of course, contrasts with the multiplier that describes the effect of s,on the change between 
JYr+, and y,,,.1, which is given by 


OAY +5 


de, =e 


15.3. Comparison of Trend-Stationary and Unit Root Processes 443 


As an illustration of calculating such a multiplier, the following ARIMA(4, 
1, 0) model was estimated for y, equal to 100 times the log of quarterly U.S. real 
GNP (¢ = 1952:II to 1984:IV): 
Ay, = 0.555 + 0.312 Ay,_, + 0.122 Ay,.. — 0.116 Ay,.; — 0.081 Ay,_4 + &. 
For this specification, the permanent effect of a one-unit change in ¢, on the level 
of real GNP is estimated to be 
wl) = /6(1) = 1/1 — 0.312 — 0.122 + 0.116 + 0.081) = 1.31. 


Transformations to Achieve Stationarity 


A final difference between trend-stationary and unit root processes that de- 
serves comment is the transformation of the data needed to generate a stationary 
time series. If the process is really trend stationary as in [15.1.2], the appropriate 
treatment is to subtract 6¢ from y, to produce a stationary representation of the 
form of [15.1.1]. By contrast, if the data were really generated by the unit root 
process [15.1.3], subtracting 6t from y, would succeed in removing the time-de- 
pendence of the mean but not the variance. For example, if the data were generated 
by [15.1.4], the random walk with drift, then 

yy — 8 = Yo + (E, + eg tos + &) =o + U, 
The variance of the residual u, is to?; it grows with the date of the observation. 
Thus, subtracting a time trend from a unit root process is not sufficient to produce 
a stationary time series. 

The correct treatment for a unit root process is to difference the series, and 
for this reason a process described by [15.1.3] is sometimes called a difference- 
stationary process. Note, however, that if one were to try to difference a trend- 
stationary process [15.1.2], the result would be 


Ay, = 6 + (1 — L)w(L)e,. 


This is a stationary time series, but a unit root has been introduced into the moving 
average representation. Thus, the result would be a noninvertible process subject 
to the potential difficulties discussed in Chapters 3 through 5. 


15.4. The Meaning of Tests for Unit Roots 


Knowing whether nonstationarity in the data is due to a deterministic time trend 
or a unit root would seem to be a very important question. For example, mac- 
roeconomists are very interested in knowing whether economic recessions have 
permanent consequences for the level of future GNP, or instead represent tem- 
porary downturns with the lost output eventually made up during the recovery. 
Nelson and Plosser (1982) argued that many economic series are better character- 
ized by unit roots than by deterministic time trends. A number of economists have 
tried to measure the size of the permanent consequences by estimating w(1) for 
various time series representations of GNP growth.‘ 

Although it might be very interesting to know whether a time series has a 
unit root, several recent papers have argued that the question is inherently un- 


+See, for example, Watson (1986), Clark (1987), Campbell and Mankiw (1987a, b), Cochrane (1988), 
Gagnon (1988), Stock and Watson (1988), Durlauf (1989), and Hamilton (1989). 


444 Chapter 15 | Models of Nonstationary Time Series 


answerable on the basis of a finite sample of observations.* The argument takes 
the form of two observations. 

The first observation is that for any unit root process there exists a stationary 
process that will be impossible to distinguish from the unit root representation for 
any given sample size T. Such a stationary process is found easily enough by setting 
one of the eigenvalues close to but not quite equal to unity. For example, suppose 
the sample consists of T = 10,000 observations that were really generated by a 
driftless random walk: 


y, =J,-1 + & true model (unit root). (15.4.1] 
Consider trying to distinguish this from the following stationary process: 
y, = by,., + &  |bdl|<1 false model (stationary). {15.4.2} 
The s-period-ahead forecast of [15.4.1] is 


Seas = Ve [15.4.3] 
with MSE 
E (yas — Irrsi)? = $07. [15.4.4] 
The corresponding forecast of [15.4.2] is 
Grose = PY: [15.4.5] 
with MSE 


E(yras as Deas = (1 + ¢? + ¢* a i i $26-)) +g. (15.4.6] 


Clearly there exists a value of ¢ sufficiently close to unity such that the observable 
implications of the stationary representation ((15.4.5] and (15.4.6]) are arbitrarily 
close to those of the unit root process ((15.4.3] and [15.4.4]) in a sample of size 
10,000. 

More formally, the conditional likelihood function for a Gaussian process 
characterized by [15.1.7] is continuous in the parameter A,. Hence, given any fixed 
sample size T, any small numbers 7 and e, and any unit root specification with 
A, = 1, there exists a stationary specification with A, < 1 with the property that 
the probability is less than ¢ that one would observe a sample of size T for which 
the value of the likelihood implied by the unit root representation differs by more 
than 7 from the value of the likelihood implied by the stationary representation. 

The converse proposition is also true—for any stationary process and a given 
sample size T, there exists a unit root process that will be impossible to distinguish 
from the unit root representation. Again, consider a simple example. Suppose the 
true process is white noise: 


y, = & true model (stationary). [15.4.7] 
Consider trying to distinguish this from 
(1—L)y,= (1+ oL)e, |6]<1 false model (unit root) [15.4.8] 
Yo = &q = 0. 
The s-period-ahead forecast of [15.4.7] is 
{ Srasie =0 
with MSE 
Eas Persie)? = 07. 
See Blough (1992a, b), Cochrane (1991), Christiano and Eichenbaum (1990), Stock (1990), and 


Sims (1989). The sharpest statement of this view, and the perspective on which the remarks in the text 
are based, is that of Blough. 


15.4, The Meaning of Tests for Unit Roots 445 


The forecast of [15.4.8] is obtained from [15.3.5]: 
Drasy = Ve + G8, 
= fAy, + Ay, t+ ° + Ayo + yib + 08, 
= {(e, + 0€,-,) + (€,-; + 6&-2) t+ °° + (€2 + G&:) + (e1)} + 68, 
= (1 + Ofe, + &-, t+ - 7° + eh. 
From [15.3.9], the MSE of the s-period-ahead forecast is 


E( Was - Viva = {1 + (s M. i)(1 + 6)"}0?. 
Again, clearly, given any fixed sample size 7, there exists a value of @ sufficiently 
close to —1 that the unit root process [15.4.8] will have virtually the identical 
observable implications to those of the stationary process [15.4.7]. 

Unit root and stationary processes differ in their implications at infinite time 
horizons, but for any given finite number of observations on the time series, there 
is arepresentative from either class of models that could account for all the observed 
features of the data. We therefore need to be careful with our choice of wording — 
testing whether a particular time series “‘contains a unit root,” or testing whether 
innovations “have a permanent effect on the level of the series,’’ however inter- 
esting, is simply impossible to do. 

Another way to express this is as follows. For a unit root process given by 
[15.1.3], the autocovariance-generating function of (1 — L)y, is 


8av(z) = ¥(z)o*p(z~'). 


The autocovariance-generating function evaluated at z = 1 is then 


Bar(1) = [y(1))o?. [15.4.9] 
Recalling that the population spectrum of Ay at frequency w is defined by 


1 
Say(w) = on Sar(e~”), 


expression [15.4.9] can alternatively be described as 2m times the spectrum at 
frequency zero: 


sar(0) = 5— [W(1)Po?, 


By contrast, if the true process is the trend-stationary specification [15.1.2], 
the autocovariance-generating function of Ay can be calculated from (3.6.15] as 


8ar(z) = (1 — z)b(z)o*y(z~")(1 - 27"), 


which evaluated at z = 1 is zero. Thus, if the true process is trend-stationary, the 
population spectrum of Ay at frequency zero is zero, whereas if the true process 
is characterized by a unit root, the population spectrum of Ay at frequency zero 
is positive. 

The question of whether y, follows a unit root process can thus equivalently 
be expressed as a question of whether the population spectrum of Ay at frequency 
zero is zero. However, there is no information in a sample of size T about cycles 
with period greater than T, just as there is no information in a sample of size T 
about the dynamic multiplier for a horizon s > T. 

These observations notwithstanding, there are several closely related and very 
interesting questions that are answerable. Given enough data, we certainly can ask 


446 Chapter 15 | Models of Nonstationary Time Series 


whether innovations have a significant effect on the level of the series over a 
specified finite horizon. For a fixed time horizon (say, s = 3 years), there exists 
a sample size (say, the half century of observations since World War II) such that 
we can meaningfully inquire whether dy,,,/de, is close to zero. We cannot tell 
whether the data were really generated by [15.4.1] or a close relative of the form 
of [15.4.2], but we can measure whether innovations have much persistence over 
a fixed interval (as in [15.4.1] or [15.4.2]) or very little persistence over that interval 
(as in [15.4.7] or [15.4.8]). 
We can also arrive at a testable hypothesis if we are willing to restrict further 
the class of processes considered. Suppose the dynamics of a given sample {y,, 
- , Yr} are to be modeled using an autoregression of fixed, known order p. For 
example, suppose we are committed to using an AR(1) process: 


Ye = Oi + &. (15.4.10] 
Within this class of models, the restriction 
Hy: ¢ = 1 


is certainly testable. While it is true that there exist local alternatives (such as 
@ = 0.99999) against which a test would have essentially no power, this is true of 
most hypothesis tests. There are also alternatives (such as @ = 0.3) that would 
lead to certain rejection of Hy given enough observations. The hypothesis ‘‘{y,} is 
an AR(1) process with a unit root” is potentially refutable; the hypothesis ‘‘{y,} is 
a general unit root process of the form (15.1.3]” is not. 

There may be good reasons to restrict ourselves to consider only low-order 
autoregressive representations. Parsimonious models often perform best, and au- 
toregressions are much easier to estimate and forecast than moving average proc- 
esses, particularly moving average processes with a root near unity. 

If we are indeed committed to describing the data with a low-order auto- 
regression, knowing whether the further restriction of a unit root should be imposed 
can clearly be important for two reasons. The first involves a familiar tradc-off 
between efficiency and consistency. If a restriction (in this case, a unit root) is 
true, more efficient estimates result from imposing it. Estimates of the other coef- 
ficients and dynamic multipliers will be more accurate, and forecasts will be better. 
If the restriction is false, the estimates are unreliable no matter how large the 
sample. Researchers differ in their advice on how to deal with this trade-off. One 
practical guide is to estimate the model both with and without the unit root imposed. 
If the key inferences are similar, so much the better. If the inferences differ, some 
attempt at explaining the conflicting findings (as in Christiano and Ljungqvist, 
1988, or Stock and Watson, 1989) may be desirable. 

In addition to the familiar trade-off between efficiency and consistency, the 
decision whether or not to impose unit roots on an autoregression also raises issues 
involving the asymptotic distribution theory one uses to test hypotheses about the 
process. This issue is explored in detail in later chapters. 


15.5. Other Approaches to Trended Time Series 


Although most of the analysis of nonstationarity in this book will be devoted to 
unit roots and time trends, this section briefly discusses two alternative approaches 
to modeling nonstationarity: fractionally integrated processes and processes with 
occasional, ‘discrete shifts in the time trend. 


15.5. Other Approaches to Trended Time Series 447 


Fractional Integration 


Recall that an integrated process of order d can be represented in the form 
(1 — L)¢y, = p(L)e,, [15.5.1] 


with 37. =olyl <, The normal assumption is that d = 1, or that the first difference 
of the series is stationary. Occasionally one finds a series for which d = 2 might 
be a better choice. 

Granger and Joyeux (1980) and Hosking (1981) suggested that noninteger 
values of d in [15.5.1] might also be useful. To understand the meaning of [15.5.1] 
for noninteger d, consider the MA(%) representation implied by [15.5.1]. It will 
be shown shortly that the inverse of the operator (1 — L)¢ exists provided that 
d < 4. Multiplying both sides of [15.5.1] by (1 — L)~¢ results in 


= (1 -— L)-4p(L)e,. (15.5.2] 
For z a scalar, define the function 
72) == 2)". 
This function has derivatives given by 
of = d-1 
rs = d-(1 — z)7 
eo 
—— (ad + 1)-d-(1 — z)~4-? 
3. 
- (d + 2):(d + 1)-d-(1 — z)-4-3 
of _ cies 
asi =(d+j-1):(d@+j-2)---(@+ 1d - zt. 


A power series expansion for f(z) around z = 0 is thus given by 
1 of ,, 1 of 
—_—— . + —_—_— 
ss 2azl., 3 ari. 
= 1+ dz + (1/2!)(d + Idz? + (1/3!)\(d + 2)(d + I)dz7 + ---. 
This suggests that the operator (1 — L)~¢ might be represented by the filter 


a - 2-4 = +2 


a 


(1 — L)-4 = 14 dL + (12d + 1)dL? 
+ (U3!)(d + 2)(d + 1d? + <-> [15.5.3] 


> byl, 


j=0 


where A, = 1 and 


A= (Uj)(d + j — Id +7 - Yd +7 -3)---(@ + fd). [15.5.4] 


Appendix 15.A to this chapter establishes that if d < 1, hk, can be approximated 
for large j by 


h,=(j + fm! [15.5.5] 


448 Chapter 15 | Models of Nonstationary Time Series 


Thus, the time series model 
y= (1 — L)o4e, = hoe, + Aye) + Aoe_a + °° [15.5.6] 


describes an MA(%) representation in which the impulse-response coefficient A, 
behaves for large j like (j + 1)4~'. For comparison, recall that the impulse-response 
coefficient associated with the AR(1) process y, = (1 — @L)~'e, is given by o/. 
The impulse-response coefficients for a stationary ARMA process decay geomet- 
rically, in contrast to the slower decay implied by [15.5.5]. Because of this slower 
rate of decay, Granger and Joyeux proposed the fractionally integrated process as 
an approach to modeling long memories in a time series. 

In a finite sample, this long memory could be approximated arbitrarily well 
with a suitably large-order ARMA representation. The goal of the fractional-dif- 
ference specification is to capture parsimoniously long-run multipliers that decay 
very slowly. 

The sequence of limiting moving average coefficients {h,}7 9 given in [15.5.4] 
can be shown to be square-summable provided that d < 4:6 


Maz<o ford<}. 


Thus, [15.5.6] defines a covariance-stationary process provided that d < 4. If d > 
4, the proposal is to difference the process before describing it by [15.5.2]. For 
example, if d = 0.7, the process of [15.5.1] implies 


(l- L)- “1 ae L)y, im w(Lye,; 


that is, Ay, is fractionally integrated with parameter d = —0.3 < }. 

Conditions under which fractional integration could arise from aggregation 
of other processes were described by Granger (1980). Geweke and Porter-Hudak 
(1983) and Sowell (1992) proposed techniques for estimating d. Diebold and Ru- 
debusch (1989) analyzed GNP data and the persistence of business cycle fluctuations 
using this approach, while Lo (1991) provided an interesting investigation of the 
persistence of movements in stock prices. 


Occasional Breaks in Trend 


According to the unit root specification [15.1.3], events are occurring all the 
time that permanently affect the course of y. Perron (1989) and Rappoport and 
Reichlin (1989) have argued that economic events that have large. permanent effects 


6Reasoning as in Appendix 3.A to Chapter 3. 
N-1 N 
Ss G+ pee Sd pew 
j=o im 
N 
eis fwd 
i] 
= 1 + [12d — 1)]jx%-"|%, 
= 1+ [12d — 1)]-[N*%4-!' -— 1], 
which converges to 1 — [1/(2d — 1)] as N—> &, provided that d < 4. 


15.5. Other Approaches to Trended Time Series 449 


are relatively rare. The idea can be illustrated with the following model, in which 
y, is stationary around a trend with a single break: 


a+ é&+e, fort<T, 
aa 


15.5.7 
a, + &+ «, fort = Ty. 


The finding is that such series would appear to exhibit unit root nonstationarity 
on the basis of the tests to be discussed in Chapter 17. 
Another way of thinking about the process in [15.5.7] is as follows: 


Ay, = é, + 64+ &, — Frets (15.5.8] 


where & = (a, — a,) when t = Ty, and is zero otherwise. Suppose é, is viewed as 
a random variable with some probability distribution—say, 


£= A — a with probability p 
% > 10 with probability 1 — p. 


Evidently, p must be quite small to represent the idea that this is a relatively rare 
event. Equation [15.5.8] could then be rewritten as 


Ay, = + [15.5.9] 
where 


= pla -— am) +8 
m= & - P(e aa a) + & — E41 


But 7, is the sum of a zero-mean white noise process [& — p(a@, — a,)] and an 
independent MA(1) process [e, — €,-,]. Therefore, an MA(1) representation for 
m, exists: From this perspective, [15.5.9] could be viewed as an ARIMA(0, 1, 1) 
process, 


Ay, = - + Vv, + Ov, 15 
with a non-Gaussian distribution for the innovations v,: 


y= Yr Evyly-1 Mra-2ve0e ). 


The optimal linear forecasting rule, 


E(Qiasl¥n Yet ) = BS + y, + OV, 


puts a nonvanishing weight on each date’s innovation. This weight does not dis- 
appear as s —> ~, because each period essentially provides a new observation on 
the variable & and the realization of €& has permanent consequences for the level 
of the series. From this perspective, a time series satisfying [15.5.7] could be 
described as a unit root process with non-Gaussian innovations. 

Lam (1990) estimated a model closely related to [15.5.7] where shifts in the 
slope of the trend line were assumed to follow a Markov chain and where U.S. 
real GNP was allowed to follow a stationary third-order autoregression around this 
trend. Results of his maximum likelihood estimation are reported in Figure 15.4. 
These findings are very interesting for the question of the long-run consequences 
of economic recessions. According to this specification, events that permanently 
changed the level of GNP coincided with the recessions of 1957, 1973, and 1980. 


450 Chapter 15 | Models of Nonstationary Time Series 


52 ss s8 61 64 67 70 73 76 79 82 


FIGURE 15.4 Discrete trend shifts estimated for U.S. real GNP, 1952-84 (Lam, 
1990). 


APPENDIX 15.A. Derivation of Selected Equations for Chapter 15 


® Derivation of Equation [15.5.5]. Write [15.5.4] as 
h,= (Uj!)(d + jf — Id + fj — 2)(d + jf -— 3)+++ (a + 1)(4) 


a covet |pevee)| vee) arate 
_fstei]ficteendicteecy, 


x [eda eget [15.4.1] 
i- (7 2) (j - 1) 


[teh 
d-1 d-1 
eee a | 


For large j, we have the approximation 


d-1]_ 1 ae 
[i+ =e [i+ 2] ; [15.A.2] 


To justify this formally, consider the function g(x) = (1 + x)‘~'. Taylor’s theorem states 
that 


108 
2 ax? | os 


(1 + x)! 


8B 
80) +> 7 


xt 
oO 


[15.4.3] 


" 


1+ (d- Ixt aC — 1)(d — 2)(1 + 8)? 


Appendix 15.A. Derivation of Selected Equations for Chapter 15 451 


for some 6 between zero and x. For x > —1 and d < 1, equation [15.A.3] implies that 
(1 + x)' 214 (d—- 1x. 
Letting x = 1/j gives 


d- a) rig ¢-1 
pete F 1] = a [15.4.4] 
j j j 


for all j > 0 and d < 1, with the approximation [15.A.2] improving as j —> », Substituting 
[15.A.4] into [15.A.1] implies that 


had a1 , de! 74 d- 3 a-\ 2 ¢-1 
ET ET ED DUD oe eas 


Chapter 15 References 


Blough, Stephen R, 1992a. ‘‘The Relationship between Power and Level for Generic Unit 
Root Tests in Finite Samples.’ Journal of Applied Econometrics 7:295-308. 

. 1992b, “‘Near Observational Equivalence of Unit Root and Stationary Processes: 
Theory and Implications.” Johns Hopkins University. Mimeo. 

Box, G, E. P., and Gwilym M. Jenkins. 1976, Time Series Analysis: Forecasting and Control, 
rev, ed, San Francisco: Holden-Day. 

Campbell, John Y., and N. Gregory Mankiw. 1987a. “Permanent and Transitory Compo- 
nents in Macroeconomic Fluctuations,” American Economic Review Papers and Proceedings 
7711-17, 

and . 1987b. “Are Output Fluctuations Transitory?” Quarterly Journal of 
Economics 102:857-80. 

Christiano, Lawrence J,, and Martin Eichenbaum. 1990, “Unit Roots in Real GNP: Do We 
Know and Do We Care?” in Allan H. Meltzer, ed., Unit Roots, Investment Measures, and 
Other Essays, 7-61, Carnegie-Rochester Conference Series on Public Policy, Vol. 32, Am- 
sterdam: North-Holland. 

and Lars Ljungqvist. 1988. “Money Does Granger-Cause Output in the Bivariate 
Money-Output Relation.” Journal of Monetary Economics 22;217-35, 

Clark, Peter K. 1987. “The Cyclical Component of U.S. Economic Activity.” Quarterly 
Journal of Economics 102,797-814, 

Cochrane, John H, 1988. “How Big Is the Random Walk in GNP?” Journal of Political 
Economy 96:893-920. 

. 1991. “A Critique of the Application of Unit Root Tests." Journal of Economic 
Dynamics and Control 15:275~84. 

Diebold, Francis X., and Glenn D, Rudebusch, 1989. “Long Memory and Persistence in 
Aggregate Output,” Journal of Monetary Economics 24:189-209, 

Durlauf, Steven N. 1989, “‘Output Persistence, Economic Structure, and Choice of Stabi- 
lization Policy.” Brookings Papers on Economic Activity 2:1989, 69-116. 

Friedman, Milton. 1957, A Theory of the Consumption Function, Princeton, N.J.: Princeton 
University Press. 

Gagnon, Joseph E. 1988. “Short-Run Models and Long-Run Forecasts: A Note on the 
Permanence of Output Fluctuations,” Quarterly Journal of Economics 103:415-24, 
Geweke, John, and Susan Porter-Hudak, 1983. “The Estimation and Application of Long 
Memory Time Series Models.” Journal of Time Series Analysis 4:221-38. 

Granger, C. W. J. 1980. “Long Memory Relationships and the Aggregation of Dynamic 
Models.” Journal of Econometrics 14:227-38. 

and Roselyne Joyeux. 1980. “An Introduction to Long-Memory Time Series Models 
and Fractional Differencing.” Journal of Time Series Analysis 1:15-29. 

Hamilton, James D, 1989, “A New Approach to the Economic Analysis of Nonstationary 
Time Series and the Business Cycle.” Econometrica 57:357-84, 


452 Chapter 15 | Models of Nonstationary Time Series 


Hosking, J. R. M. 1981. “Fractional Differencing.” Biometrika 68:165-76. 

Lam, Pok-sang. 1990, ‘“‘The Hamilton Model with a General Autoregressive Component: 
Estimation and Comparison with Other Models of Economic Time Series.” Journal of 
Monetary Economics 26:409-32. 

Lo, Andrew W, 1991, “Long-Term Memory in Stock Market Prices.” Econometrica 59,1279- 
1313, 

Muth, John F, 1960, “Optimal Properties of Exponentially Weighted Forecasts.” Journal 
of the American Statistical Association 55:299-306. 

Nelson, Charles R., and Charles I. Plosser, 1982. ‘“‘Trends and Random Walks in Macro- 
economic Time Series: Some Evidence and Implications.” Journal of Monetary Economics 
10:139-62. 

Perron, Pierre, 1989, “The Great Crash. the Oil Price Shock, and the Unit Root Hypothesis.” 
Econometrica 57:1361-1401, 

Rappoport, Peter, and Lucrezia Reichlin, 1989. ‘Segmented Trends and Nonstationary Time 
Series.” Economic Journal supplement 99;168-77,. 

Sims, Christopher A. 1989, “Modeling Trends.” Yale University. Mimeo, 

Sowell. Fallaw. 1992, “Maximum Likelihood Estimation of Stationary Univariate Fraction- 
ally Integrated Time Series Models.” Journal of Econometrics 53:165-88. 

Stock, James H. 1990, *‘Unit Roots in Real GNP: Do We Know and Do We Care?” A 
Comment,” in Allan H. Meltzer, ed., Unit Roots, Investment Measures, and Other Essays. 
63-82, Carnegie-Rochester Conference Series on Public Policy, Vol. 32. Amsterdam; North- 
Holland. 

and Mark W, Watson. 1988, “Variable Trends in Economic Time Series."* Journal 
of Economic Perspectives vol, 2, no, 3, 147-74, 

and . 1989. “Interpreting the Evidence on Money-Income Causality.” Journal 
of Econometrics 40:161-81. 

Watson, Mark W. 1986, “Univariate Detrending Methods with Stochastic Trends,"" Journal 
of Monetary Economics 18:49-75, 


Chanter 18 Rofarancav AKT 


16 


Processes 
with Deterministic 
Time Trends 


The coefficients of regression models involving unit roots or deterministic time 
trends are typically estimated by ordinary least squares. However, the asymptotic 
distributions of the coefficient estimates cannot be calculated in the same way as 
are those for regression models involving stationary variables. Among other dif- 
ficulties, the estimates of different parameters will in general have different asymp- 
totic rates of convergence. This chapter introduces the idea of different rates of 
convergence and develops a general approach to obtaining asymptotic distributions 
suggested by Sims, Stock, and Watson (1990).' This chapter deals exclusively with 
processes invalving deterministic time trends but no unit roots. One of the results 
for such processes will be that the usual OLS t and F statistics, calculated in the 
usual way, have the same asymptotic distributions as they do for stationary regres- 
sions. Although the limiting distributions are standard, the techniques used to verify 
these limiting distributions are different from those used in Chapter 8. These 
techniques will also be used to develop the asymptotic distributions for processes 
including unit roots in Chapters 17 and 18. 

This chapter begins with the simplest example of i.i.d. innovations around a 
deterministic time trend. Section 16.1 derives the asymptotic distributions of the 
coefficient estimates for this model and illustrates a rescaling of variables that is 
necessary to accommodate different asymptotic rates of convergence. Section 16.2 
shows that despite the different asymptotic rates of convergence, the standard OLS 
t and F statistics have the usual limiting distributions for this model. Section 16.3 
develops analogous results for a covariance-stationary autoregression around a 
deterministic time trend. That section also introduces the Sims, Stock, and Watson 
technique of transforming the regression model into a canonical form for which 
the asymptotic distribution is simpler to describe. 


16.1. Asymptotic Distribution of OLS Estimates 

of the Simple Time Trend Model 

This section considers OLS estimation of the parameters of a simple time trend, 
y= at 6+, [16.1.1] 

for «, a white noise process. If e, ~ N(0, a), then the model [16.1.1] satisfies the 


classical regression assumptions? and the standard OLS t or F statistics in equations 
g Pp q 


‘A simpler version of this theme appeared in the analysis of a univariate process with unit roots by 
Fuller (1976). 
*See Assumption 8.1 in Chapter 8. 


454 


[8.1.26] and [8.1.32] would have exact small-sample ¢ or F distributions. On the 
other hand, if e, is non-Gaussian, then a slightly different technique for finding the 
asymptotic distributions of the OLS estimates of a and 5 would have to be used 
from that employed for stationary regressions in Chapter 8. This chapter introduces 
this technique, which will prove useful not only for studying time trends but also 
for analyzing estimators for a variety of nonstationary processes in Chapters 17 
and 18.3 

Recall the approach used to find asymptotic distributions for regressions with 
stationary explanatory variables in Chapter 8. Write [16.1.1] in the form of the 
standard regression model, 


y, = XP + &,, [16.1.2] 
where 
x, =f[i ¢] [16.1.3] 
(1x2) 
a 
= F 16.1.4 
eae *] [ ] 


Let b; denote the OLS estimate of B based on a sample of size T: 


b;= al = [> oxi] b sn] [16.1.5] 


Recall from equation [8.2.3] that the deviation of the OLS estimate from the truc 
value can be expressed as 


(br ~ B) = [2 | [> si]. [16.1.6] 


To find the limiting distribution for a regression with stationary explanatory var- 
iables, the approach in Chapter 8 was to multiply [16.1.6] by 7, resulting in 


VT(b; - B) = [ur > xa | ie p3 nei] [16.1.7] 


The usual assumption was that (1/7) 27_, x,x; converged in probability to a non- 
singular matrix Q while (1/\/T) 57, x,e, converged in distribution to a N(0, 7?Q) 
random variable, implying that \/T(b; — B) > N(0, o?Q-'). 

To see why this same argument cannot be used for a deterministic time trend, 
note that for x, and B given in equations [16.1.3] and [16.1.4], expression [16.1.6] 


would be 
-1 
a; —@ =1 «St DE, 
ts = : 16.1.8 
Fs ~ | Ee at? dte, i 
"The general approach in these chapters follows Sims, Stock, and Watson (1990). 


16.1. OLS Estimates of the Simple Time Trend Model 455 


where = denotes summation for ¢ = 1 through T. It is straightforward to show by 
induction that* 


Mr=7T(T + 12 [16.1.9] 
5 ? = T(T + 1)2T + 146. [16.1.10] 


Thus, the leading term in 27, ¢ is 77/2; that is, 
wry > t = (1/T?)[(T7/2) + (T/2)] = 2 + W2T) > 12. [16.1.1] 
Similarly, the leading term in 27. , ¢? is T°/3: 
(UT?) > t? = (1/T)[(2T6) + (3776) + 7/6] 


1/3 + 1/(2T) + 1/(6T?) [16.1.12] 
> 1/3. 


For future reference, we note here the general pattern—the leading term in 
DLytvis TY + 1): 


T 
(U/T’*!) Dt’ > 1v + 1). [16.1.13] 
c=1 
To verify [16.1.13], note that 
T T 
(Tt!) S tv = (WT) D> WT)". [16.1.14] 
rst t=1 


The right side of [16.1.14] can be viewed as an approximation to the area under 
the curve 


Fear 


for r between zero and unity. To see this, notice that (1/T)-(t/T)" represents 
the area of a rectangle with width (1/7) and height r" evaluated at r = t/T (see 
Figure 16.1). Thus, [16.1.14] is the sum of the area of these rectangles evaluated 


Clearly, [16.1.9] and [16.1.10] hold for T = 1, Given that [16.1.9] holds for T, 


Ted r 


Sirs phe eae 1/2 + (T + 1) = (7 + I(T) + 1] = (7 + I(T + 2), 


ead te 


establishing that [16.1.9] holds for T + 1. Similarly, given that {16.1.10] holds for 7, 


T+i 


> ¢ = TT + WT + 16 + (7 + 1? 


= (T + I{[T(2T + 16) + (T + 1} 

= (T + 1)(2T? + 7T + 6)/6 

= (T + 1)(T + 2)[2(T + 1) + 13/6, 
establishing that [16.1.10] holds for T + 1. 


456 Chapter 16 | Processes with Deterministic Time Trends 


i 2 3 Ti 1 r 
ans VRS T 
FIGURE 16.1 Demonstration that (1/T) 27, (t/T)" > fir" dr = W(v + 1). 


atr = 1/T, 2/T,...,1. As T— ©, this sum converges to the area under the 
curve f(r): 


T t 
(UT) > (t/T)' > [ r’dr=r'tu(y + 1)flco = Wy + 1). [16.1.15] 
t=] 


For x, given in [16.1.3], results [16.1.9] and [16.1.10] imply that 


oe 21 =) T T(T + 1)2 
UT St Se? T(T + 12 T(T + 1)(2T + 1/6 


In contrast to the usual result for stationary regressions, for the matrix in [16.1.16], 
(1/T) Z7_,x,x; diverges. To obtain a convergent matrix, [16.1.16] would have to 
be divided by T? rather than T: 


Z 0 0 
-3 , 
T Sx |f 4° 


Unfortunately, this limiting matrix cannot be inverted, as (1/T) 27_,x,x/ can be 
in the usual case. Hence, a different approach from that in the stationary case will 
be needed to calculate the asymptotic distribution of b;. 

It turns out that the OLS estimates &, and 6, have different asymptotic rates 
of convergence. To arrive at nondegenerate limiting distributions, &; is multiplied 
by VT, whereas 6, must be multiplied by 73?! We can think of this adjustment 
as premultiplying [16.1.6] or [16.1.8] by the matrix 


Y;= Ka a [16.1.17] 


| [16.1.16] 


16.1. OLS Estimates of the Simple Time Trend Model 457 


resulting in 


ee = >| = va ; xxi] [3 «| 


T32(8, — 6) = 


T =t T 
= vf 5 xx | v¥i'| 3 xe, [16.1.18] 
t=1 t=t 


= {ye| 3 xiv} {ve 3 xa |} 


Consider the first term in the last expression of [16.1.18]. Substituting from 


[16.1.17] and [16.1.16], 
T- 0 )fs1 sr]fr- o 
o T™\[s sell o r-” 


paler) ee | 


T-?3t T-3¢? 
Thus, it follows from [16.1.11] and [16.1.12] that 


{v3 xxilvs'} +e. [16.1.19] 


where 
Q= E if [16.1.20] 


Turning next to the second term in [16.1.18], 


_l< eee. Le] _ (I/VT)Ze, 
re [2 se . 0 alla | - Bees ese 


Under standard assumptions about e,, this vector will be asymptotically Gaussian. 
For example, suppose that ¢, is i.i.d. with mean zero, variance c?, and finite fourth 
moment. Then the first element of the vector in [16.1.21] satisfies 


a 
Q/VT) > «> NO, c), 
tal 
by the central limit theorem. 

For the second element of the vector in [16.1.21], observe that {(t/T)e,} is a 
martingale difference sequence that satisfies the conditions of Proposition 7.8. 
Specifically, its variance is 

ao? = E[(t/T)e}* = o?-(t?/T?), 


where 
ae T 
(UT) > o? = 0°(1/T?) }) 2 > 09/3. 
i=t emt 


458 Chapter 16 | Processes with Deterministic Time Trends 


Furthermore, (1/T) 37, [(t/T)e,? > 07/3. To verify this last claim, notice that 


T 2 
e(an) 5 [(t/T)e,? ~ (1/T) x o?) 


T r ; 
2 272 
(wr) 2 [(t/T)e,? - (/T) > (To ) [16.1.22] 


2 
E(w) 5 (t/T P(e? - 2°)) 

= wry > (t/T)*E(e? — ao). 
But from [16.1.13], T times the magnitude in [16.1.22] converges to 

(1/T) > (t/T)*E(e2 — 02)? > (1/5) E(e? - 0), 
meaning that [16.1.22] itself converges to zero: 
(1/T) > [(t/T)e]? — (1/T) 5 a? > 0. 

But this implies that 
(1/T) > [(t/T)e]? > 0/3, 


as claimed. Hence, from Proposition 7.8, (1/VT) 57, (t/T)e, satisfies the central 
limit theorem: 


¢ 
(VT) >; (t/T)e, 5 N(0, 0/3). 
rat 
Finally, consider the joint distribution of the two elements in the (2 x 1) 


vector described by [16.1.21]. Any linear combination of these elements takes the 
form 


(VT) > [Ay + Ag(t/T)]e,. 


Then [A, + A,(¢/T)]e, is also a martingale difference sequence with positive variance* 
given by o7[A? + 2A,A,(t/T) + A3(t/T)?] satisfying 


T 
(VT) DS, ofA? + 2A,A(t/T) + A3(t/T)?] > o[A? + 20,A,(4) + AZ(4)] 
ml 
= ad'QA 
for X = (A;, A2)’ and Q the matrix in [16.1.20]. Furthermore, 
" 
(/T) > [A, + ap(t/T) Pe? > o?r'QA; [16.1.23] 
t=} 
see Exercise 16.1. Thus any linear combination of the two elements in the vector 


in [16.1.21] is asymptotically Gaussian, implying a limiting bivariate Gaussian dis- 


‘More accurately, a given nonzero A, and A, will produce a zero variance for [A, + A2(t/T)]e, for 
at most a single value of ¢, which does not affect the validity of the asymptotic claim. 


16.1. OLS Estimates of the Simple Time Trend Model 459 


tribution: 
(VT )Ze, 
(UV/T)Z(t/T )e, 
From [16.1.19] and [16.1.24], the asymptotic distribution of [16.1.18] can be 
calculated as in Example 7.5 of Chapter 7: 


VT(@r ~ @)| -1, ‘O-')) = 29-1 
ne 2 | n. [Q-'-27Q-Q-']) = NO, o7Q7').  [16.1.25] 


These results can be summarized as follows. 


+s N(0, 2Q). [16.1.24] 


Proposition 16.1: Let y, be generated according to the simple deterministic time 
trend [16.1.1] where ¢, is ii.d, with E(e?) = a? and E(e?}) < ~, Then 


VI(ér ~ @) o} f1 4] 
pe ~ 8) +N ol 714 4 7 [16.1.26] 
Note that the resulting estimate of the coefficient on the time trend (6,) is 
Superconsistent—not only does 5, 5, but even when multiplied by T, we still 
have 

18, - 8) 50: [16.1.27] 

see Exercise 16.2. 
Different rates of convergence are sometimes described in terms of order in 


probability. A’ sequence of random variables {X7}7., is said to be O,(T ~ '”) if for 
every ¢ > 0, there exists an M > 0 such that 


P{|X7| > MAV/T} < « [16.1.28] 


for all T; in other words, the random variable VT-X,, is almost certain to fall within 
+M for any T. Most of the estimators encountered for stationary time series are 
O,(T - '?). For example, suppose that X represents the mean of a sample of size T, 


r 
= (UT) D yn 

where {y,} is iid. with mean zero and variance o”. Then the variance of Xr is 

a7/T, But Chebyshev’s inequality implies that 


P{|X;| > M/VT} = = (o/M)? 


@IT 
MT 
for any M. By choosing M so that (a/M)? < e, condition [16.1.28] is guaranteed. 
Since the standard deviation of the estimator is a/VT, by choosing M to be a 
suitable multiple of o, the band X,, + M/VT can include as much of the density 
as desired. 

As another example, the estimator &, in [16.1.26] would also be said to be 

O,(T~'”). Since VT times (&; — @) is asymptotically Gaussian, there exists a band 
+M/VT around &, that contains as much of the probability distribution as desired. 

In general, a sequence of random variables {X7}7., is said to be O,(T ~*) if 
for every € > 0 there exists an M > 0 such that 


P{|X;| > MU(T*)} < e. [16.1.29] 


Thus, for example, the estimator 6, in [16.1.26] is O (7 ~*”), since there exists a band 
+M around T32(§, — 5) that contains as much of the probability distribution as 
desired. 


460 Chapter 16 | Processes with Deterministic Time Trends 


16.2. Hypothesis Testing for the Simple Time 
Trend Model 


If the innovations ¢, for the simple time trend [16.1.1] are Gaussian, then the OLS 
estimates &, and 8, are Gaussian and the usual] OLS tand F tests have exact small- 
sample ¢ and F distributions for all sample sizes T. Thus, despite the fact that &, 
and 8, have different asymptotic rates of convergence, the standard errors Gz, and 
&, evidently have offsetting asymptotic behavior so that the statistics such as 
(8, — 5/63, are asymptotically N(0, 1) when the innovations are Gaussian. We 
might thus conjecture that the usual ¢ and F tests are asymptotically valid for non- 
Gaussian innovations as well. This conjecture is indeed correct, as we now verify. 

First consider the OLS ¢ test of the null hypothesis @ = a, which can be 
written as 


4 _— 
ar h 


= [16.2.1] 


{seu oyexexn |} 


Here s3. denotes the OLS estimate of a: 
T 
[1(T - 2)] > (y, — @r — 8;t): [16.2.2] 


and (X;X;) = 57,x,x/ denotes the matrix in equation [16.1.16]. The numerator 
and denominator of [16.2.1] can further be multiplied by VT, resulting in 


VT (Gr — &) 


tr 


Sy 


tr = T... fve?" [16.2.3] 
[ait ayy] fal 
Note further from [16.1.17] that 
[VT 0] =[1 OY;. [16.2.4] 
Substituting [16.2.4] into [16.2.3], 
Papeete A! te) oe [16.2.5] 


{ett ove)" ¥y| 2] 
But recall from [16.1.19] that 
Y¥7(X7X7)-'¥r = [Yr'(XPX7)Y7']' > QT [16.2.6] 
It is straightforward to show that s}- o?. Recall further that VT(a — ay) + 
N(O, o?q'') for q"! the (1, 1) element of Q~-'. Hence, from [16.2.5], 
VT (ar — %) 7 VT (ar — &) 


V2 aVq" 
for wef 


But this is an asymptotically Gaussian variable divided by the square root of its 
variance, and so asymptotically it has a N(0, 1) distribution. Thus, the usual OLS 
t test of a = a, will give an asymptotically valid inference. 

Similarly, consider the usual OLS t test of 8 = 6y: 


5; — bo 


{s410 nyocsxn)-[ 2] 


16.2. Hypothesis Testing for the Simple Time Trend Model 461 


P 
tro 


[16.2.7] 


fr = 


Multiplying numerator and denominator by T*?, 
T*(5; — 6) 


0 172 
{ss10 T*?)(X;X7)7! | | } 
78; ~ oy] 


{s+ Iv rx )-*¥e] | 


& T32(57 = 80) 
oVge 
which again is asymptotically a N(0, 1) variable. Thus, although &, and 8, converge 
at different rates, the corresponding standard errors G,,. and Gj, also incorporaté 
different orders of T, with the result that the usual OLS ¢ tests are asymptotically 
valid. 
It is interesting also to consider a test of a single hypothesis involving both 
a and 6, 


tr 


Ho: ra + 726 = 3, 


where r,, 72, and r are parameters that describe the hypothesis. A ¢ test of Hy can 
be obtained from the square root of the OLS F test (expression [8.1.32]):° 


(ria + rd7 — 1) 


{set r.\(X7X,7)7! Ala 


In this case we multiply numerator and denominator by V7, the slower rate of 
convergence among the two estimators &, and 6;: 


fsevri, racerxn~|"]vr}" 
{sivtlr re]VF'Vr(X5Xp)"!Y YG! | vt}" 


_ _ VT(na&r + 67 — r) 
{spr7[¥r(X7Xz7)7 '¥r]ep}!?’ 


=v” _ | 71 33 rn 
cove[te=[e]-["} usa 


Similarly, recall from [16.1.27] that 8, is superconsistent, implying that 


« tr = 


tr= 


where 


Vi (nar + bp — 1) > VF(nar + 28 — 1), [16.2.9] 


“With a single linear restriction as here, m = | and expression [8.1.32] describes an F(1, T — k) 
variable when the innovations are Gaussian. But an F(1, T — &) variable is the square of ai(T — k) 
variable. The test is described here in terms of a ¢ test rather than an F test in order to facilitate 
comparison with the earlier results in this section. 


462 Chapter 16 | Processes with Deterministic Time Trends 


where 6 is the true population value for the time trend parameter. Again applying 
[16.2.6], it follows that 


p VI (rer + 6 — r) - VT (rar + 5-1) 


iro 2 22q hyi/2 
{ort go-|"} nee 


But notice that 
VT (nar + 8 - 1) = VT[r (ar — a) + ra t+ 7,8 - rl 
= VT[ri(ér — @)] 
under the null hypothesis. Hence, under the null, 
2 VTIn(@r - |] | VT(ar - @) 
rod {Pog '}i2 = {o2q!'}12 ’ 
which asymptotically has a N(0, 1) distribution. Thus, again, the usual OLS ¢ test 
of H, is valid asymptotically. 
This last example illustrates the following general principle: A test involving 
a single restriction across parameters with different rates of convergence is dom- 
inated asymptotically by the parameters with the slowest rates of convergence. 
This means that a test involving both a and 6 that employs the estimated value of 
6 would have the same asymptotic properties under the null as a test that employs 


the true value of 6. 
Finally, consider a joint test of separate hypotheses about @ and 6, 


ie od = a 
ve | ie 


B = Bo. 
The Wald form of the OLS y? test of H, is found from [8.2.23] by taking R = I: 
Xz = (br — Bo)'[s}(X7X7)~']7 (br — Bo) 
= (b; - Bo)'Yrl¥7rs3(X7X 7)~'¥7]~'¥ 7 (br — Bo) 
4 [Y¥r(br ~ Bo)]'l?Q-']-'[¥7r(67 ~ Bud]: 


Recalling [16.1.25], this is a quadratic form in a two-dimensional Gaussian vector 
of the sort considered in Proposition 8.1, from which 


[16.2.10] 


or, in vector form, 


x¥ > x°(2). 
Thus, again, the usual OLS test is asymptotically valid. 


16.3. Asymptotic Inference for an Autoregressive Process 
Around a Deterministic Time Trend 


The same principles can be used to study a general autoregressive process around 
a deterministic time trend: 


y = a+ SE + Oy + boa tt PpYi-p + &,. [16.3.1] 


16.3. An Autoregressive Process Around a Deterministic Time Trend 463 


It is assumed throughout this section that e, is i.i.d. with mean zero, variance a, 
and finite fourth moment, and that roots of 


1 — @z - doz? — +++ - G27 =0 


lie Outside the unit circle. Consider a sample of T + p observations on y, {y_,,,,. 
Yopars + +s Yr}, and let &,, 5r, dy. Tree Dp, 7 denote coefficient estimates based 
on ordinary least squares estimation of [16. 3.1] fort = 1, 2, _T. 


A Useful Transformation of the Regressors 
By adding and subtracting ¢,[@ + 8(t — j)] forj = 1,2,...,p om the right 
side, the regression model [16.3.1] can equivalently be written as 
y= 1+ o + G2 + +++ + G) + FL +O, + Go +++: + O,)E 
— 5(p, + 26. + +++ + Pb,) + Sly, — @ — 8(t — 1)] [16.3.2] 
+ dfy-2- a- 6-2) +--- 
+ dbplYi-p ~ & ~ S(t ~ p)] + «, 


or 
y = at + Met btyh, + Ofyh. + +++ + otyt, +e, [16.3.3] 
where 
a* = [a(l + o + G2. + +++ + bp) — 6(H, + 262 + +++ + Pd,)] 
&* = 81 + b+ +°--+ + o) 
o; =, forj=1,2,...,p 
and 
ye; =y-,;- a - Ot - jf) forj=1,2,....p. [16.3.4] 


The idea of transforming the regression into a form such as [16.3.3] is due 
to Sims, Stock, and Watson (1990).” The objective is to rewrite the regressors of 
[16.3.1] in terms of zero-mean covariance-stationary random variables (the terms 
yr; for j = 1,2,..., p), a constant term, and a time trend, Transforming the 
regressors in this way isolates components of the OLS coefficient vector with 
different rates of convergence and provides a general technique for finding the 
asymptotic distribution of regressions involving nonstationary variables, A general 
result is that, if such a transformed equation were estimated by OLS, the coefficients 
on zero-mean covariance-stationary random variables (in this case, * 7, oi 7, 

OF T) would converge at rate VT to a Gaussian distribution. The coefficients 
ay ‘and 5% from OLS estimation of [16.3.3] turn out to behave asymptotically 
exactly like &, and 6; for the simple time trend model analyzed in Section 16.1 
and are asymptotically independent of the ¢*’s. 

It is helpful to describe this transformation in more general notation that will 
also apply to more complicated models in the chapters that follow. The original 
regression model [16.3.1] can be written 


y= xB + &, [16.3.5] 


7A simpler version of this theme appeared in the analysis of a univariate process with unit roots by 
Fuller (1976). 


464 Chapter 16 | Processes with Deterministic Time Trends 


where 


yi-t g; 
Yi-2 a) 
x, = : =}: 16.3.6 
(p+2)xk Yi-p (p+2)xt do, [ ] 
1 a 
t 6 


The algebraic transformation in arriving at [16.3.3] could then be described as 
rewriting [16.3.5] in the form 


y, = x/G[G']"'B + €, = [x7]'B* + «,, [16.3.7] 
where 
1 0 0 0 0 
0 1 0 0 0 
é : ; see : ot 3, 
(p+2)x(p+2) 0 0 i ; ae cia 
-~a+6 -a+26 ++ -a+pé 1 0 
= 5 ~§ toe -§ 0 1 
1 0 0 0 0 
0 1 0 0 0 
G-! zs iH H ; 4 
cee 0 0 ie : ie 
a-5 a-~2 ++ a- ps 1 0 
5 § tee 6 01 
rt 
Ye 
x? = Gx, = : [16.3.9] 
Yi-p 
1 
t 
¢F 
$3 
B* = [G']-'B = Me [16.3, 10] 
a 
a* 
5* 


The system of [16.3.7] is just an algebraically equivalent representation of 
the regression model [16.3.5], Notice that the estimate of B* based on an OLS 
regression of y, on x} is given by 


[Zeer] [3 x] 
lw) oS 
isy-(S xxi) o'6(3 x) ies 


es = 


b* 


16.3, An Autoregressive Process Around a Deterministic Time Trend 465 


where b denotes the estimated coefficient vector from an OLS regression of y, on 
x, Thus, the coefficient estimate for the transformed regression (b*) is a simple 
linear transformation of the coefficient estimate for the original system (b), The 
fitted value for date t associated with the transformed regression is 


[xt]'b* = [Gx]'[G']}-'b = xib. 


Thus, the fitted values for the transformed regression are numerically identical to 
the fitted values from the original regression. 

Of course, given data only on {y,}, we could not actually estimate the trans- 
formed regression by OLS, because construction of x* from x, requires knowledge 
of the true values of the parameters a and 6, It is nevertheless helpful to summarize 
the properties of hypothetical OLS estimation of [16.3.7], because [16.3.7] is easier 
to analyze than [16.3.5]. Moreover, once we find the asymptotic distribution of b*, 
the asymptotic distribution of b can be inferred by inverting [16.3.11]: 


b = G'b*. [16.3,12] 


The Asymptotic Distribution of OLS Estimates 
for the Transformed Regression 


Resa 16.A to this chapter demonstrates that 


Y,(bs — B*) > NO, o7[Q*]-), [16.3,13] 
where 
VT 0 O° O 0 0 
0 vro- 0 0 90 
Ye eee ee a 
(p42) xlp+2) 0 0 Of VF 0 0 [16.3.14] 
0 0 0 0 VF 0 
0 0 O 0 0 732 
ys vi Ve yp-1 9 0 
vi v6 vi Yp-2 0 0 
aaa Pe 2 Bl clas, ihe) cee 16.3.15 
eee Yp-1 Vp-2 Yp-3 vs yg §=600 (0 [ ] 
0 0 0 0 1 4 
0 0 oO 0 ; 


for y7 = E(y?y/_,). In other words, the OLS estimate b* is asymptotically Gauss- 
ian, with the coefficient on the time trend (4*) converging at rate T?? and all other 
coefficients converging at rate V7. The earlier result [16.1.26] is a special case of 
[16.3.13] with p = 0, 


The Asymptotic Distribution of OLS Estimates 
for the Original Regression 


What does this result imply about the asymptotic: distribution of b, the esti- 
mated coefficient vector for the OLS regression that is actually estimated? Writing 


466 Chapter 16 | Processes with Deterministic Time Trends 


out [16.3.12] explicitly using [16.3.8], we have 


b 1 0 ee 0 o oll é: 

om 0 1 eee 0 0 0 os 

|} | 0 0 ee oo ol ge | 26-3261 
PB 

a -a+6 -a+25 «+ -a+ps 1 0 at 

é —6 —6§ ae —§ 01 é* 


The OLS estimates 4; of the untransformed regression are identical to the corre- 
sponding coefficients of the transformed regression ob} , So the asymptotic distri- 
bution of 4; is given immediately by [16.3.13]. The estimate &, is a linear com- 
bination of variables that converge to a Gaussian distribution at rate \/7, and so 
a, behaves the same way. Specifically, @; = gb}, where 


8.=[-a+6 -a+25 ++ -a+pés 1 QI, 
and so, from [16.3.13], 
VT (ar — a) > N(0, 77g /[Q*]~'g,). [16.3.17] 
Finally, the estimate 6, is a linear combination of variables converging at different 
rates: 
5, = gsb7 + 55, 
where 
gs=[-6 -6 «+ -8 0 O]. 
Its asymptotic distribution is governed by the variables with the slowest rate of 
convergence: 
VT(6r — 8) = VT(8t + gabe — 8* — B58") 
—> VT(8" + g3bt — 5* — 858") 
= gsVT(bt — B*) 
> N(O, o7ga[Q*]~'gs). 
Thus, each of the elements of b; individually is asymptotically Gaussian and 
O,(T-'). The asymptotic distribution of the full vector V7(b; — B) is multi- 
variate Gaussian, though with a singular variance-covariance matrix. Specifically, 


the particular linear combination of the elements of b; that recovers 5%, the time 
trend coefficient of the hypothetical regression, 


8} = —gsb7 + 8, = 5b..7+ bb, 4+ °°° + 86,.7 + 6r, 
converges to a point mass around 5* even when scaled by VT: 
VT(é*% — 8*) 40. 
However, [16.3.13] establishes that 
732(8% — 8*) 5 N(0, (qt)? +2 +2) 
for (q*)?+?-"+2 the bottom right element of [Q*]~'. 


Hypothesis Tests 


The preceding analysis described the asymptotic distribution of b in terms of 
the properties of the transformed regression estimates b*. This might seem to imply 


16,3, An Autoregressive Process Around a Deterministic Time Trend 467 


that knowledge of the transformation matrix G in [16.3.8] is necessary in order to 
conduct hypothesis tests. Fortunately, this is not the case. The results of Section 
16.2 turn out to apply equally well to the general model [16.3.1]—the usual ¢ and 
F tests about B calculated in the usual way on the untransformed system are all 
asymptotically valid. 

Consider the following null hypothesis about the parameters of the untrans- 
formed system: 


Hy: RB = 5. [16.3.18] 
Here R is a known [m x (p + 2)] matrix, ris a known (m x 1) vector, and m 
is the number of restrictions. The Wald form of the OLS y? test of Hy (expression 
[8.2.23]) is 


x} = (Rb; — o'[san(3 xxi) R'| (Rb; — r). [16.3.19] 


Here b; is the OLS estimate of B based on observation of {y_,41.Y-p+asee +s 

Yor Yivee+ + Yr} and sy = [1(T — p — 2)] 27, (y — x¢bz)’. 

Under the null hypothesis [16.3.18], expression [16.3.19] can be rewritten 
rT -t -t 
x7 = [R(br — py [sen ( 3 xxi] R'| [R(b; — B)] 
= [RG'(G')"\(br — B)I 
oy r -1 | 
x Exxome «x) cr’ | [RG'(G')~'(by — B)]. 
t=1 
[16.3.20] 
Notice that 


t=1 
for x* given by [16.3.9]. Similarly, from [16.3.10] and [16.3.11], 
(b} — B*) = (G')“(by ~ B). 
Defining 
R* = RG’, 


expression [16.3.20] can be written 


x} = [R*(b} - pyr |sen(> xi [xt ') Ry [16.3.21] 
x [R*(b — B*)]. 


Expression [16.3.21] will be recognized as the y” test that would be calculated 
if we had estimated the transformed system and wanted to test the hypothesis that 
R*p* = r (recall that the fitted values for the transformed and untransformed 
regressions are identical, so that s3 will be the same value for either representation). 
Observe that the transformed regression does not actually have to be estimated in 
order to calculate this statistic, since [16.3.21] is numerically identical to the x? 
statistic [16.3.20] that is calculated from the untransformed system in the usual 
way. Nevertheless, expression [16.3.21] gives us another way of thinking about the 
distribution of the statistic as actually calculated in [16.3.20]. 


468 Chapter 16 | Processes with Deterministic Time Trends 


Expression [16.3.21] can be further rewritten as - 


7 = [R*Y7'Y,(b} ~ B*)]’ 
-1 


x [anrver,(3 ste) v¥F eT | [16.3.22] 
x [R*Y7'Y7(b7 — B*)] 


for Y the matrix in [16.3.14]. Recall the insight from Section 16.2 that hypothesis 
tests involving coefficients with different rates of convergence will be dominated 
by the variables with the slowest rate of convergence. This means that some of the 
elements of R* may be irrelevant asymptotically, so that [16.3.22] has the same 
asymptotic distribution as a simpler expression. To describe this expression, con- 
sider two possibilities. 


Case 1. Each of the m Hypotheses Represented 
by R*B* = r Involves a Parameter that Converges 


at Rate VT 


Of course, we could trivially rewrite any system of restrictions so as to involve 
O,(T~"?) parameters in every equation. For example, the null hypothesis 


Hy: of = 0, 5* = 0 [16.3.23] 
could be rewritten as 


Hy ¢ = 0, 8" = $3, [16.3.24] 


which seems to include $3 in each restriction. For purposes of implementing a 
test of Hy, it does not matter which representation of H, is used, since either 
will produce the identical value for the test statistic." For purposes of analyzing 
the properties of the test, we distinguish a hypothesis such as [16.3.23] from a 
hypothesis involving only $3 and $3. For this distinction to be meaningful, we 
will assume that Hy would be written in the form of [16.3.23] rather than [16.3.24]. 


“More generally, let H be any ‘Ronsingular (m X m) matrix. Then the null hypothesis RB = r can 
equivalently be written as RB = f. where R = HR and é = Hr, The X? statistic constructed from the 
sccond parameterization is 


xC = (Rb - or [sin (3 xx) &'| (Rb — #) 
rot 
= (Rb - ym | san (3S xx) r'| H-'H(Rb - r) 
= (Rb — oy [sin(3 xx!) R'| (Rb - r), 


which is identical to the y* statistic constructed from the first parameterization. The representation 
[16.3.24] is an example of such a transformation of [16.3.23]. with 


a=([J, {]. 


16.3, An Autoregressive Process Around a Deterministic Time Trend 469 


In general terms, this means that R* is “upper triangular.’’? ‘Case 1’’ describes 
the situation in which the first p + 1 elements of the last row of R* are not all 
zero. 

For case 1, even though some of the hypotheses may involve 4+, a test of the 
null hypothesis will be asymptotically equivalent to a test that treated 5* as if known 
with certainty. This is a consequence of 6% being superconsistent. To develop this 
result rigorously, notice that 


IVT PRIVEE rt al 
nye IVT rBiVT oo hp (VT rh py 2fT? 
T i ‘ 4 = . 
: : ; Func! 
rhlVT rholVT ae rps lt VT Tinp+2T? 
and define 
Y, =vTL, 
(nt X nt) 
rhori2 ct Tipas Thp+olT 
R* i r3, rh Te T3.p +t 3.p+2lT 
me |e : 4 : 
rin rn eat Tinp+) Tn.p+2l T. 


These matrices were chosen so that 
R*Y;! = Y;'R?. [16.3.25] 
The matrix R* has the further property that 
Rt R*, [16.3.26] 


where R* involves only those restrictions that affect the asymptotic distribution: 


* * ‘ae * 
Tir Tt Tip+t 0 
* * Sela * 

R* = Tar 22 Tip+1 O 


* a “ete * 
Tai lm2 Tmp+t 0 


ae “Upper hel nak means that If the set of restrictions in H, involves parameters B/,, Bi... 

B;, with i, < i, < +++ <4,, then elements of R* in rows 2 through m and columns 1 through i, are all 
zero. This is any a normalization—any hypothesis R*B* = r can be written in such a form by 
selecting a restriction involving 87 to be the first row of R* and then multiplying the first row of this 
system of equations by a suitable constant and subtracting it from each of the following rows. If the 
system of restrictions represented by rows 2 through m of the resulting matrix involves parameters 
Bi. Bp. ++. By with J) <j, < ++ <j, then it is assumed that the elements in rows 3 through m and 
columns 1 through j, are all zero. An example of an upper triangular system is 


O riy rh, O cs hin 
0 0 0 rhs, se hy, 0 
0 0 0 OQ: Tike Tks 


470 Chapter 16 | Processes with Deterministic Time Trends 


Substituting [16.3.25] into [16.3.22], 
xt = [V7'RFY7(b7 — B*)I' 
x [aviv S xi [x7] ) vaterreey | cers 7Y7(b7 — B*)] 
= [RFY,(b} — B*)/'Y7! 
x vs ¥,( 3 x?[x?]' ) van] verre: - 8°) 
= [RtY;(b}. — B*)]! 


x |atrv.( 5 «ts71') vil [RFY7(bF — B*)] 


% [R°Yr(bF — BI [PR [QTR YT IR Yb} - B)]—_ [16.3.27] 
by virtue of [16.3.26] and [16.4.4]. 
Now [16.3.13] implies that 


*¥,(b} — B*) 4 NO, R¥o7{Q*]-'[R*]}), 
and so [16.3.27] is a quadratic form in an asymptotically Gaussian variable of the 
kind covered in Proposition 8.1. It is therefore asymptotically x?(m). Since [16.3.27] 
is numerically identical to [16.3.19], the Wald form of the OLS x? test, calculated 
in the usual way from the untransformed regression [16.3.1], has the usual y?(1m) 
distribution. 


Case 2. One of the Hypotheses Involves Only 
the Time Trend Parameter 3* 


Again assuming for purposes of discussion that R* is upper triangular, for 
case 2 the hypothesis about 5* will be the sole entry in the mth row of R*: 


* * 
rh rio ala Tip+t Tip+2 
* 
r3 132 ke ptt T2p+2 
R* = 7 i was 7 ‘ 
: : : gis 
Pme-tl Tin 1.2 ee rm-lp+i Pin-lp+2 
* 
0 0 nay! 0 Pin.p+2 
For this case, define 
VT 0 0 0 
Oo AT 0 0 
Y, = : : 
(111 X 111) 0 0 2 VT 0 
0 0 one 0 T 32 
and 
* * * 
rt; Ti ae Tip+t Th p+2/T 


* * “ane * * 
ra '22 Tap+i pall 


*" x" Jee + 7 * ; 
Pm-ti lm-12 lmn-i.p+) rin-ip+2T 
* 
0 0 rey 0 lin.p+2 


16.3. An Autoregressive Process Around a Deterministic Time Trend 471 


Notice that these matrices again satisfy [16.3.25] and [16.3.26] with 


* * oes * 
rit Ti Tip4+l 0 
* * eee * 
. ra Tn 2 p+i 0 
R* = : : ead : : 
* + sete phe ad 
Tm-ia  Tm-1.2 Tm-t.p+) 0 
0 O + 0 Trp+2 


The analysis of [16.3.27] thus goes through for this case as well with no change. 


Summary 


Any standard OLS x? test of the null hypothesis RB = r for the regression 
model [16.3.1] can be calculated and interpreted in the usual way. The test is 
asymptotically valid for any hypothesis about any subset of the parameters in B. 
The elements of R do not have to be ordered or expressed in any particular form 
for this to be true. 


APPENDIX 16.A. Derivation of Selected Equations for Chapter 16 


@ Derivation of [16.3.13]. As in [16.1.6], 


br — B* = [2 start] [2 se]. [16.A.1] 


since the population residuals ¢, are identical for the transformed and untransformed rep- 
resentations. As in [16.1.18], premultiply by Y, to write 


ad 
Y7(b} — B*) = {ve > sorry} {ye Sere. [16.A.2] 
From [16.3.9], 


ZA? Zhi. + ZyAwA, Ty. Lye, 
Zyhayes, Uys)? + Zyhayt, Lye, Lys 
e : ; : : : ‘ 
PE a ot ad ama Eds een ee 
tel zy prin Lyf pYin2 — Zytp)* LYpp Ztyt, 
zyry 2yr2 ve Zyh, Zl zt 
Sty? , Styt. | Ltyt, St =e 


and 
x 
¥r' > xP [er Yz! 
ion 
T 20 To Syhiyie ee TO Sy it, Tose. Ponte, 
Ty TOS ee a ys, Taye. Tne 


” To yt pyr To 'Zyn yn: ° TUE(yh, To'2yh, T-"2ty?p 

Tr Syr T('Syh, oe TO Syt, ToT TS 

T-3tyt, TOStyt, TO Styt, ToS TEe 
[16.A.3] 


472 Chapter 16 | Processes with Deterministic Time Trends 


For the first p rows and columns, the row /, column j element of this matrix is 


Tr 
Ye > Yuin; 
a 


But y,* follows a zero-mean stationary AR(p) process satisfying the conditions of Exercise 
7.7. Thus, these terms converge in probability to y#_,,. The first p elements of row p + 1 


(or the first p elements of column p + 1) are of the form 


ia 
sauep he fae 
ml 


which converge in probability to zero. The first p elements of row p + 2 (or the first p 


elements of column p + 2) are of the form 


x 
T'S WT )y2, 
fmt 


which can be shown to converge in probability to zero with a ready adaptation of the 
techniques in Chapter 7 (see Exercise 16.3). Finally, the (2 x 2) matrix in the bottom right 


corner of [16.A.3] converges to 
1 4 
4 4] 


Yr! Dd xiixr]'¥z' Q* 


at 


Thus 


for Q* the matrix in [16.3.15]. 
Turning next to the second term in [16.A.2], 


To 2y hye, 
To Zyl 2e, 
T ‘ > 
Ys! eo = : = T7712 i. 
: 2 mee T-'?Syt_ 6, im $ 
T-'"Xe, 
T-"?Xt/T)e, 
where 
rue 
yen 28, 
a Yen: 
€, 
(t/T)e, 
Bur &, is a martingale difference sequence with variance 
E(&,§) = o7Q?, 
where 
Ya yt ; Yp-1 0 0 
vow oY yz. 0 0 
Qt = : : : : : : 
: Yp-1 Yp-2 Yo-3 oD 7 0 0 
0 0 0 +. 0 1 wT 


0 0 Os: 0 WT AT? 


[16.A.4] 


[16.A.5] 


Appendix 16.A. Derivation of Selected Equations for Chapter 16 473 


and 
T 
(UT) > Q7 > Q*. 
=i 
Applying the arguments used in Exercise 8.3 and in [16.1.24], it can be shown that 


T 
Yz! D x*e, + N(0, o7Q*). [16.4.6] 


zl 


It follows from [16.A.4], [16.A.6], and [16.A.2] that 
Y,(b¥ — B*) 5 N(O, [Q*]-'°Q*[Q*]-') = NO, o[Q*]-'), 
as claimed in [16.3.13]. a 


Chapter 16 Exercises 


16.1, Verify result [16.1.23]. 
16.2. Verify expression [16.1.27]. 
16.3. Let y, be covariance-stationary with mean zero and absolutely summable autoco- 
variances: 
> ly}<2 
jan 
for y, = E(y,y,-;). Adapting the argument in expression [7.2.6], show that 


MS. 


T 
T'S (t/T)y, > 0. 
pel 


Chapter 16 References 


Fuller, Wayne A. 1976, Introduction to Statistical Time Series. New York: Wiley. 


Sims. Christopher A., James H. Stock, and Mark W. Watson. 1990. “Inference in Linear 
Time Series Models with Some Unit Roots.” Econometrica 58:113-44, 


474 Chapter 16 | Processes with Deterministic Time Trends 


17 


Univariate Processes 
with Unit Roots 


This chapter discusses statistical inference for univariate processes containing a 
unit root. Section 17.1 gives a brief explanation of why the asymptotic distributions 
and rates of convergence for the estimated coefficients of unit root processes differ 
from those for stationary processes. The asymptotic distributions for unit root 
processes can be described in terins of functionals on Brownian motion. The basic 
idea behind Brownian motion is introduced in Section 17.2. The technical tools 
used to establish that the asymptotic distributions of certain statistics involving unit 
root processes can be represented in terms of such functionals are developed in 
Section 17.3, though it is not necessary to master these tools in order to read 
Sections 17.4 through 17.9. Section 17.4 derives the asymptotic distribution of the 
estimated coefficient for a first-order autoregression when the true process is a 
random walk. This distribution turns out to depend on whether a constant or time 
trend is included in the estimated regression and whether the true random walk is 
characterized by nonzero drift. 

Section 17.5 extends the results of Section 17.3 to cover unit root processes 
whose differences exhibit general serial correlation. These results can be used to 
develop two different classes of tests for unit roots. One approach, due to Phillips 
and Perron (1988), adjusts the statistics calculated from a simple first-order au- 
toregression to account for serial correlation of the differenced data. The second 
approach, due to Dickey and Fuller (1979), adds lags to the autoregression. These 
approaches are reviewed in Sections 17.6 and 17.7, respectively. Section 17.7 further 
derives the properties of all of the estimated coefficients for a pth-order auto- 
regression when one of the roots is unity. 

Readers interested solely in how these results are applied in practice may 
want to begin with the summaries in Table 17.2 or Table 17.3 and with the empirical 
applications described in Examples 17.6 through 17.9. 


17.1. Introduction 


Consider OLS estimation of a Gaussian AR(1) process, 


Ye = Py-1 + uy, [17.1.1] 
where u, ~ i.i.d. N(O, o?), and yy = 0. The OLS estimate of p is given by 
ie 
D> Y-1¥1 
p, = -L—. [17.1.2] 


T 
> Vit 
i=l 


475 


We saw in Chapter 8 that if the true value of p is less than 1 in absolute value, 
then 


VT (6; — p) > N(O, (1 — p”)). [17.1.3] 
If [17.1.3] were also valid for the case when p = 1, it would seem to claim that 


VT(ér — p) has zero variance, or that the distribution collapses to a point mass 
at zero: 


VT (6; - 1)40. [17.1.4] 
As we shall see shortly, [17.1.4] is indeed a valid statement for unit root processes, 
but it obviously is not very helpful for hypothesis tests. To obtain a nondegenerate 
asymptotic distribution for 6; in the unit root case, it turns out that we have to 
multiply 6, by T rather than by \/T. Thus, the unit root coefficient converges at 
a faster rate (J) than a coefficient for a stationary regression (which converges at 
VT), but at a slower rate than the coefficient on a time trend in the regressions 
analyzed in the previous chapter (which converged at 7%), 
To get a better sense of why scaling by T is necessary when the true value 
of p is unity, recall that the difference between the estimate 6, and the true value 
can be expressed as in equation [8.2.3]:! 


- 
> Yi 14, 


Cj [17.1.5] 
Es > yey 
om 
so that 
= 
(WT) Dd yr-1% 
iG -) [17.1.6] 
(WT?) DF 
tml 
Consider first the numerator in [17.1.6]. When the true value of pis unity, equation 
[17.1.1] describes a random walk with 
y =u +h, tors +, [17.1.7] 
since yy) = 0. It follows from [17.1.7] that 
y, ~ N(O, ot). [17.1.8] 
Note further that for a random walk, 
y? = (y-1 tw) = YP + 2yr uy, + UP, 
implying that 


y-14, = (1/2){y? > yr - u7}. [17.1.9] 
If [17.1.9] is summed over ¢ = 1, 2,..., T, the result is 
T T; 

py Yi, = (L2){y% — y3} — (1/2) 2 u?, [17.1.10] 


Recalling that yy = 0, equation [17.1.10] establishes that 
T T 
(UT) >. y,-yu, = (1/2)-(1/T)y? — (1/2)-(V/T) Dw? [17.1.1] 
t=} tml 
'This discussion is based on Fuller (1976, p. 369). 


476 Chapter 17 | Univariate Processes with Unit Roots 


and if each side of [17.1.11] is divided by o, the result is 


(+) > ie (3) (2%) - (4 male *).3 > [171.12] 


But [17.1.8] implies that y-/(oVT) is a N(0, 1) variable, so that its square is x2(1): 
LyrM(aVT)P ~ x?(1). [17.1.13] 


Also, 2/_, u? is the sum of Ti.i.d. random variables, each with mean o?, and so, 
by the law of large numbers, 


je 
(V/T): > u? 4 o?. [17.1.14] 
tml 
Using [17.1.13] and [17.1.14], it follows from [17.1.12] that 


[1/(o?T)] > y,-1u, -> (1/2)-(X — 1), [17.1.15] 


where X ~ y7(1). 
Turning next to the denominator of [17.1.6], consider 


A 
2 yy [17.1.16] 


Recall from [17.1.8] that y,_, ~ N(0, a(t — 1)),so E(y2.,) = o(¢ — 1). Consider 
the mean of [17.1.16], 


|S vi] = 0? 5 (t -— 1) = oA(T — 1)772. 


In order to construct a random variable that could have a convergent distribution, 
the quantity in [17.1.16] will have to be divided by T?, as was done in the denom- 
inator of [17.1.6]. 

To summarize, if the true process is a random walk, then the deviation of 
the OLS estimate from the true value (6; — 1) must be multiplied by T rather 
than VT to obtain a variable with a useful asymptotic distribution. Moreover, this 
asymptotic distribution is not the usual Gaussian distribution but instead is a ratio 
involving a y?(1) variable in the numerator and a separate, nonstandard distribution 
in the denominator. 

The asymptotic distribution of T(67 — 1) will be fully characterized in Sec- 
tion 17.4. In preparation for this, the idea of Brownian motion is introduced in 
Section 17.2, followed by a discussion of the functional central limit theorem in 
Section 17.3. 


17.2. Brownian Motion 


Consider a random walk, 
Y= War + ©, [17.2.1] 


in which the innovations are standard Normal variables: 
€, ~ iid N(O, 1). 
If the process is started with y, = 0, then it follows as in [17.1.7] and [17.1.8] that 
y =e te tre te, 
~ N(0, ¢). 


17.2. Brownian Motion 477 


Moreover, the change in the value of y between dates ¢ and s, 
Vs — Yr = E141 + Ena tee + €;, 


is itself N(0, (s — ¢)) and is independent of the change between dates r and q for 
any datest<s<r<q. 

Consider the change between y,_, and y,. This innovation e, was taken to be 
N(O, 1). Suppose we view e, as the sum of two independent Gaussian variables: 


&, = €,, + es, 


with ¢,, ~ i.i.d. N(O, 4). We might then associate e,, with the change between y,_, 
and the value of y at some interim point (say, y,_ (12), 


Yi-(ray — Vea = Are [17.2.2] 
and é2, with the change between y,_ (1,2) and y,: 
Yi — Ye-ciny = Cx. [17.2.3] 


Sampled at integer dates ¢ = 1, 2,..., the process of [17.2.2] and [17.2.3] will 
have exactly the same properties as [17.2.1], since 


yy — Yar = @y + ea, ~ iid, N(0, 1). 


In addition, the process of [17.2.2] and [17.2.3] is defined also at the noninteger 
dates {t + 4}%» and retains the property for both integer and noninteger dates that 
y, — y, ~ N(O, 5s — t) with y, — y, independent of the change over any other 
nonoverlapping interval. 

By the same reasoning, we could imagine partitioning the change between 
¢ — land ¢into N separate subperiods: 


Vi — Vraa = Cy + Cy tre + my 


with e, ~ i.i.d. N(O, 1/N). The result would be a process with all the same properties 
as [17.2.1], defined at a finer and finer grid of dates as we increase N. The limit 
as N —» ~ is a continuous-time process known as standard Brownian motion. The 
value of this process at date ¢ is denoted W(t).? A continuous-time process is a 
random variable that takes on a value for any nonnegative real number ¢, as distinct 
from a discrete-time process, which is only defined at integer values of ¢t. To 
emphasize the distinction, we will put the date in parentheses when describing the 
value of a continuous-time variable at date ¢ (as in W(¢)) and use subscripts for a 
discrete-time variable (as in y,). A discrete-time process was represented as a 
countable sequence of random variables, denoted {y,}7,. A realization of a con- 
tinuous-time process can be viewed as a stochastic function, denoted W(-), where 
W: ¢€ [0, ~)—> R'. 

A particular realization of Brownian motion turns out to be a continuous 
function of t. To see why it would be continuous, recall that the change between 
tand ¢ + A is distributed N(0, A). Such a change is essentially certain to be 
arbitrarily small as the interval A goes to zero. 


Definition: Standard Brownian motion W(:-) is a continuous-time stochastic proc- 
ess, associating each date t & [0, 1] with the scalar W(t) such that: 


(a) W(0) = 0; 


*Brownian motion is sometimes also referred to as a Wiener process. 


478 Chapter 17 | Univariate Processes with Unit Roots 


(b) For any dates 0 = t, <t, <-++ <4t, = 1, the changes [W(t,.) — W(t,)], 
(W(t) — W(e)]...- (W(t) — W(t -1)] are independent multivariate Gauss- 
ian with [W(s) — W(t)] ~ N(O, 5 — 1); 

(c) For any given realization, W(t) is continuous in t with probability 1. 


There are advantages to restricting the analysis to dates t within a closed 
interval. All of the results in this text relate to the behavior of Brownian motion 
for dates within the unit interval (¢ € [0, 1]), and in anticipation of this we have 
simply defined W(-) to be a function mapping ¢ € [0, 1] into R'. 

Other continuous-time processes can be generated from standard Brownian 
motion. For example, the process 


Z(t) = o W(t) 


has independent increments and is distributed N(0, ot) across realizations. Such 
a process is described as Brownian motion with variance o?. Thus, standard Brown- 
ian motion could also be described as Brownian motion with unit variance. 

As another example, 


2(t) = (WOR [17.2.4] 


would be distributed as ¢ times a y7(1) variable across realizations. 

Although W(t) is continuous in ¢, it cannot be differentiated using standard 
calculus; the direction of change at ¢ is likely to be completely different from that 
at ¢ + A, no matter how small we make A.? 


17.3. The Functional Central Limit Theorem 


One of the uses of Brownian motion is to permit more general statements of the 
central limit theorem than those in Chapter 7. Recall the simplest version of the 
central limit theorem: if u, ~ i.i.d. with mean zero and variance o?, then the sample 
mean it, = (1/T)Z/_, u, satisfies 

VTi; > N(0, c?). 

Consider now an estimator based on the following principle: When given a 
sample of size T, we calculate the mean of the first half of the sample and throw 
out the rest of the observations: 

(7/24" 


Bray = (U(T2\*) Du, 


Here [T/2]* denotes the largest integer that is less than or equal to 7/2; that is, 

[7/2]* = T/2 for T even and [7/2]* = (T — 1)/2 for T odd. This strange estimator 
would also satisfy the central limit theorem: 
L 

V[T/2)}* Gr ae N(0, o?). [17.3.1] 

Moreover, this estimator would be independent of an estimator that uses only the 

second half of the sample. 
More generally, we can construct a variable X;(r) from the sample mean of 


*For an introduction to differentiation and integration of Brownian motion, see Malliaris and Brock 
(1982, Chapter 2). 


17.3. The Functional Central Limit Theoren 479 


the first rth fraction of observations, r € [0, 1], defined by 


{Tr}* 
X-(r) = (1/T) >) u,. [17.3.2] 
tml 
For any given realization, X;(r) is a step function in r, with 
0 forO0=r< I/T 
u,/T for /T =r < 2/T 


X,(r) = (u, + u/T for 2/T = r < 3/T [17.3.3] 


(u, + Uy + ++* + uzV/T forr = 1. 
Then 


1Tr|* Try 


| 
VT-Xr(r) = (UVT) 2 = (VETTE) [173.4] 


= r=] 
But 
wr 
(nV Tr}*) D> u,—> N(O, o), 
by the central limit theorem as in [17.3.1], while (V[Tr]*/VT) — Vr. Hence, the 
asymptotic distribution of VT-X;(r) in [17.3.4] is that of Vr times a N(0, 2) 
random variable, or 


VT: X,(r) > N(O, ro?) 
and 
VT -[X7(r/o] > N(O, 7). [17.3.5] 


If we were similarly to consider the behavior of a sample mean based on 
observations [Tr,]* through [7r,]* for r, > r,, we would conclude that this too is 
asymptotically Normal, 


VT [Xr(r2) — Xr(rWo4 NO. — 7), 
and is independent of the estimator in [17.3.5], provided that r < r,. It thus should 
not be surprising that the sequence of stochastic functions {VT-X+(- Vo}se 
has an asymptotic probability law that is described by standard Brownian motion 
W(:): 
VT: X7(: o> W(-). [17.3.6] 

Note the difference between the claims in [17.3.5] and [17.3.6]. The expression 
X,(-) denotes a random function while X;(r) denotes the value that function 
assumes at date r; thus, X;(-) is a function, while X;(r) is a random variable. 

Result [17.3.6] is known as the functional central limit theorem. The derivation 
here assumed that u, wasi.i.d. A more general statement will be provided in Section 
17.5. 

Evaluated at r = 1, the function X;(r) in [17.3.2] is just the sample mean: 


X7(1) = (VT) ») Uy. 


Thus, when the functions in [17.3.6] are evaluated at r = 1, the conventional 
central limit theorem [7.1.6] obtains as a special case of [17.3.6]: 


VT Xo = [lo VT)] > u,—> W(1) ~ N(0, 1). [17.3.7] 


480 Chapter 17 | Univariate Processes with Unit Roots 


We earlier defined convergence in law for random variables, and now we 
need to extend the definition to cover random functions. Let S(-) represent a 
continuous-time stochastic process with S(r) representing its value at some date r 
for r € [0, 1]. Suppose, further, that for any given realization, S(-) is a continuous 
function of r with probability 1. For {S;(-)}#_., a sequence of such continuous 
functions, we say that S;(-) a S(-) if all of the following hold:* 


(a) For any finite collection of & particular dates, 
OSrn<n<s- <7, 81, 


the sequence of k-dimensional random vectors {fy}, converges in distri- 
bution to the vector y, where 


S,(r) S(7,) 
yr= Sr(r2) y= 5(r2) : 
Sr(re) S(%) 


(b) For each c > 0, the probability that S,(r,) differs from S;(r2) for any 
dates r, and r, within 6 of each other goes to zero uniformly in T as 6 — 0; 


(c) P{|S;(0)| > A} —> 0 uniformly in T as A > &. 


This definition applies to sequences of continuous functions, though the func- 
tion in [17.3.2] is a discontinuous step function. Fortunately, the discontinuities 
occur at a countable set of points. Formally, S;(-) can be replaced with a similar 
continuous function, interpolating between the steps (as in Hall and Heyde, 1980). 
Alternatively, the definition of convergence of random functions can be generalized 
to allow for discontinuities of the type in [17.3.2] (as in Chapter 3 of Billingsley, 
1968). 

It will also be helpful to extend the earlier definition of convergence in prob- 
ability to sequences of random functions. Let {S;(-)}7., and {V7(-)};., denote 
sequences of random continuous functions with S;: r © [0, 1] > R! and V;: r € 
[0, 1]—> R'. Let the scalar Y; represent the largest amount by which S;(r) differs 
from V;(r) for any r: 


Y;= sup |S7(r) - V-(v)|. 
rejo.i] 
Thus, {Y;}7., is a sequence of random variables, and we could talk about its 
probability limit using the standard.definition given in [7.1.2]. If the sequence of 


scalars {Y;}7_, converges in probability to zero, then we say that the sequence of 
functions S;(-) converges in probability to V-(-). That is, the expression 


Sr(+) > Vr(+) 
is interpreted to mean that 
sup Sr) — Vr()I 40. 
With this definition, result (a) of Proposition 7.3 can be generalized to apply 


*The sequence of probability measures induced by {5,(-)}#., weakly converges (in the sense of 
Billingsley, 1968) to the probability measure induced by S(-) if and only if conditions (a) through (c) 
hold; see Theorem A.2, p. 275, in Hall and Heyde (1980). 


17.3. The Functional Central Limit Theorem 481 


to sequences of functions. Specifically, if {S;(-)}£_, and {V;(- )}F2; are sequences 
of continuous functions with Viel: ) 4 S7(-) and S;(-) = S(-) for S(-) a con- 
tinuous function, then V;(-) -> S(-); see, for example, Stinchcombe and White 
(1993). 


Example 17.1 

Let {x;}%., be a sequence of random scalars with x; -% 0, and let {S;(-)}¥., 
be a sequence of random continuous functions, S;: r € [0, 1] — R' with 
S7(-) 45 (-). Then the sequence of functions {V7(-)}#., defined by V(r) = 
S7(r) + x7 has the property that V;(-) - S(-). To see this, note that 
V(r) — Sr(r) = x, for all r, so that 


sup |S;(r) — Vr(r)| = |xrl, 
reall 


which converges in probability to zero. Hence, V;(-)- S;(-), and therefore 
Vr(-) > S(-). 


Example 17.2 
Let 7, be a strictly stationary time series with finite fourth moment, and let 
Sp(r) = (UVT)- Hr. Then S;(-) -4 0. To see this, note that 


P{ sup isrcol > 8} 


"= Paci m| > 6] or [|(/VT)-m] > 5] ot 
r [|GQ/VT)-n7| > 5} 
= Peta > 5} 
7 EwvD: nm} 
84 


where the next-to-last line follows from Chebyshev’s inequality. Since E (7) 
is finite, this probability goes to zero as T —> ™, establishing that S;(-) -5 0, 
as claimed. 


Continuous Mapping Theorem 


In Chapter 7 we saw that if {x;}7_, is a sequence of random variables with 
xr x and if g: R' — R! is a continuous function, then g(x7) 4 g(x). A similar 
result holds for sequences of random functions. Here, the analog to the function 
g(*) is a continuous functional, which could associate a real random variable y 
with the stochastic function S(-). For example, y = fi S (7) drandy = fi(S()f ar 
represent continuous functionals.* The continuous mapping theorem® states that if 
S,(-) > S(-) and g(-) is a continuous functional, then g(S;(-)) > g(S(-)). 


SContinuity of a functional g(-) in this context means that for any ¢ > 0, there exists a 5 > 0 such 
that if A(r) and k(r) are any continuous bounded functions on [0, 1], A: (0, 1] + R' and k: [0, 1] + 
R'. such that |A(r) — &(r)| < 8 for all r & [0, 1], then 


iala(-)] - slk()Il < «. 
"See, for example, Theorem A.3 on p. 276 in Hall and Heyde (1980). 


482 Chapter 17 | Univariate Processes with Unit Roots 


The continuous mapping theorem also applies to a continuous functional g(-) 
that maps a continuous bounded function on [0, 1] into another continuous bounded 
function on (0, 1]. For example, the function whose value at r is a positive constant 
a times h(r) represents the result of applying the continuous functional g[h(-)] = 
o-h(-) to h(-).” Thus, it follows from [17.3.6] that 


VT-X7(-) 5 0: W(-). [17.3.8] 


Recalling that W(r) ~ N(0, r), result [17.3.8] implies that VT-X,(r) ~ N(0, 077). 
As another example, consider the function S$;(-) whose value at r is given 


by 
Sr(r) = [VT-X7()P. [17.3.9] 
Since VT-X,(-) > o- W(-), it follows that 
, Sr(-) > o{W(-)P. (17.3.10] 


In other words, if the value W(r) from a realization of standard Brownian motion 
at every date r is squared and then multiplied by o?, the resulting continuous-time 
process would follow essentially the same probability law as does the continuous- 
time process defined by S.-(r) in [17.3.9] for T sufficiently large. 


Applications to Unit Root Processes 


The use of the functional central limit theorem to calculate the asymptotic 
distribution of statistics constructed from unit root processes was pioneered by 
Phillips (1986, 1987).* The simplest illustration of Phillips’s approach is provided 
by a random walk, 

Ve = Wai + Uy (17.3.11] 


where {u,} is an i.i.d. sequence with mean zero and variance o?. If yy = 0, then 
(17.3.11] implies that 


y, = Uy t+ up tes + uy. (17.3.12] 


Equation [17.3.12] can be used to express the stochastic function X;-(r) defined in 
[17.3.3] as 


0 for0 =r< 1/T 
y/T for WT sr < QT 


X,() = 4 yo/T for 2/T <r <3/T (17.3.13] 


yrlT forr=1. 


Figure 17.1 plots X;(r) as a function of r. Note that the area under this step function 


‘Here continuity of the functional g(-) means that for any ¢ > 0, there exists a 6 > 0 such thut if 
A(r) and k(r) are any continuous bounded functions on [0, 1], A: [0, 1] > R' and &: [0, 1] > R'. such 
that |h(r) — k(r)| < 8 for ail r  [0, 1], then 


Isth(n) - glk) < © 
for all r € [0, 1]. 

*Result [17.4.7] in the next section for the case with i.i.d. errors was first derived by White (1958). 
Phillips (1986, 1987) developed the general derivation presented here based on the functional central 
limit theorem and the continuous mapping theorem. Other important contributions include Dickey and 
Faller (1979), Chan and Wei (1988), Park and Phillips (1988. 1989), Sims, Stock, and Watson (1990). 
and Phillips and Solo (1992). 


17.3. The Functional Central Limit Theorem 483 


X+(r) 


rT 

T % 

viv fy 

as 2 3 4 : ‘ 
T T T T 


FIGURE 17.1 Plot of X;(r) as a function of r. 


is the sum of’T rectangles. The ¢th rectangle has width 1/T and height y,_,/T, and 
therefore has area y,_,/T?. The integral of X;(r) is thus equivalent to 


[ X,(r) dr = y,/T? + yolT? + +++ + yp_,/T?. (17.3.14] 
Multiplying both sides of [17.3.14] by VT establishes that 
i VT-X;(r) dr = T-32 > Yow [17.3.15] 
But we know from [17.3.8] and the continuous mapping theorem that as T—> ~, 
i VT: X,(r) dr o- I ; W(r) dr, 
implying from [17.3.15] that 
T 732 yp, oO" [, W(n) dr. [17.3.16] 


It is also instructive to derive [17.3.16] from first principles. From [17.3.12], 
we can write 


T~32 Dy Yor = T73[u, + (uy + Uy) + (UY + uy + Us) + °° 
+ (uy + Uy + us t+ +++ + Uz_,)] 
= T-#(T — tu, + (T — 2)u + (T - 3)u3 + °°: 
+ (T- (T - lu] (17.3.17] 


¥ 
T7323) (T - ou, 
rl 
T T 
= T-'2 >) u, — T7322 > tu, 
t= Lene | 


484 Chapter 17 | Univariate Processes with Unit Roots 


Recall from [16.1.24] that 


; 
-1n 
. > 0 aj1 4 
ze |SMloloty al): (17.3.18] 
T-2 5) tu, a: 
=i 


Thus, [17.3.17] implies that 7-7? 27, y,_, is asymptotically Gaussian with mean 
zero and variance equal to 


o*{l — 2-(1/2) + 1/3} = of. 


Evidently, o-f4W(r)dr in [17.3.16] describes a random variable that has a 
N(0, 7/3) distribution. 

Thus, if y, is a driftless random walk, the sample mean T~'X7_, y, diverges 
but T~*?S7_, y, converges to a Gaussian random variable whose distribution can 
be described as the integral of the realization of Brownian motion with variance 
a. 

Expression [17.3.17] also gives us a way to describe the asymptotic distribution 
of T-*? S7_, tu, in terms of functionals on Brownian motion: 


T T iT 
T -32 > tu, = T-12 >» u, — T73? >. Vina 
r=] md rai 


(17.3.19] 
= o- W(1) - an W(r) dr, 


with the last line following from [17.3.7] and [17.3.16]. Recalling [17.3.18], the 
random variable on the right side of [17.3.19] evidently has a N(0, 0/3) distribution. 

A similar argument to that in [17.3.15] can be used to describe the asymptotic 
distribution of the sum of squares of a random walk. The statistic S;(r) defined in 
[17.3.9], 


S-(r) = T-[X7()P, [17.3.20] 
can be written using [17.3.13] as 
0 for0 =r < I/T 
y2T for VT =r < 2T 
Sri) = 7) y3/T forW/Tsr<3/T {17.3.21] 
y3/T = forr=1. 
It follows that 
i 
j Sr(r) dr = yRIT? + y3T? + 00+ + yFZ_/T?. 
) 
Thus, from [17.3.10] and the continuous mapping theorem, 
T 1 
T-?  y2., 0? [ [Wn]? dr. [17.3.22] 
tal 


Two other useful results are 


T Tr i 
T-32 > ty,., = T7327 Y (WT y,-1 oe o[, rW(r) dr — [17.3.23] 
mt nl 


17.3. The Functional Central Limit Theorem 485 


for r = /T and 
T T L t 
T? > ty, = T72 Dd T)y2., > o-[ r-(W(r)f dr. [17.3.24] 
tm] tml 
As yet another useful application, consider the statistic in [17.1.11]: 
T T 
To" yi, = (1/2) (UT )yz — (1/2)-(W/T) & u?. 
p= tml 
Recalling [17.3.21], this can be written 
T T 
T-' > y,-.u, = (1/2)S7(1) — (12)(/T) >) u?. [17.3.25] 
gal ml 


But (1/T) 7_,u? -% o?, by the law of large numbers, and S;(1) > o7[W(1)}°, 
by (17.3.10]. It thus follows from [17.3.25] that 


T > yu, > (12)0[W(1))2 - (1/2)0?. [17.3.26] 
t=l 


Recall that W(1), the value of standard Brownian motion at date r = 1, has a 
N(O, 1) distribution, meaning that [W(1)]* has a y(1) distribution. Result [17.3.26] 
is therefore just another way to express the earlier result [17.1.15] using a functional 
on Brownian motion instead of the y? distribution. 


17.4. Asymptotic Properties of a First-Order 
Autoregression when the True Coefficient Is Unity 


We are now in a position to calculate the asymptotic distribution of some simple 
regressions involving unit roots. For convenience, the results from Section 17.3 are 
collected in the form of a proposition. 


Proposition 17.1: Suppose that &, follows a random walk without drift, 


& > g-1 + u,, 
where &) = O and {u} is an i.i.d. sequence with mean zero and variance a7. Then 


(a) T-"2 Ds u—> o-W(1) [17.3.7]; 

(b) 7S é-u, > (12)o%[W(L)P - 1} [17.3.26]; 
(c) T-32 s tu, > o- W(1) - o[ W(r) dr —([17.3.19]; 
(d) T-*” > bi.35 of W(r) dr —_[17.3.16]; 


(e) T-2 Dy 2,3 @?: [ [WP dr [17.3.2]; 


T 


(NTH YE So | rear 117323}: 


t=] 
T 


@ Taso [ror dr (173.245, 


tm] 


” 
(hk) T-OF" Sev + 1) forv=0,1,... — [16.1.15]. 
tilt 


486 Chapter 17 | Univariate Processes with Unit Roots 


The expressions in brackets indicate where the stated result was earlier de- 
rived. Though the earlier derivations assumed that the initial value & was equal 
to zero, the same results are obtained when &, is any fixed value or drawn from a 
specified distribution as in Phillips (1987). 

The asymptotic distributions in Proposition 17.1 are all written in terms of 
functionals on standard Brownian motion, denoted W(r). Note that this is the same 
Brownian motion W(r) in each result (a) through (g), so that in general the mag- 
nitudes in Proposition 17.1 are all correlated. If we are not interested in capturing 
these correlations, then there are simpler ways to describe the asymptotic distri- 
butions. For example, we have seen that (a) is just a N(0, o) distribution, (b) is 
(1/2)o*-[y2(1) — 1], and (c) and (d) are (0, 77/3). Exercise 17.1 gives an example 
of one approach to calculating the covariances among random variables described 
by these functionals on Brownian motion. 

Proposition 17.1 can be used to calculate the asymptotic distributions of 
statistics from a number of simple regressions involving unit roots. This section 
discusses several key cases. 


Case 1. No Constant Term or Time Trend in the Regression; 
True Process Is a Random Walk 


Consider first OLS estimation of p based on an AR(1) regression, 
Ve = Py-1 + Uys [17.4.1] 


where u, is i.i.d. with mean zero and variance o. We are interested in the properties 
of the OLS estimate 


a [17.4.2] 


when the true value of p is unity. From [17.1.6], the deviation of the OLS estimate 
from the true value is characterized by 


Yr 14, 
T(, - 1) = —=)—. [17.4.3] 
te > Yi-1 
= 
If the true value of p is unity, then 
y= Yo t Uy tu, tess t+ uy, (17.4.4] 


Apart from the initial term yy (which does not affect any of the asymptotic distri- 
butions), the variable y, is the same as the quantity labeled & in Proposition 17.1. 
From result (b) of that proposition, 


; 
T-! > y,-14, > (12)07{{W)P — 1}, [17.4.5] 
t=1 
while from result (e), 
T L 
T-? > y2.,3 0: [ [W(N)F ar. [17.4.6] 
mt 


Since [17.4.3] is a continuous function of [17.4.5] and [17.4.6], it follows from 
Proposition 7.3(c) that under the null hypothesis that p = 1, the OLS estimate 


17.4. Asymptotic Properties of a First-Order Autoregression 487 


pr is characterized by 
4 G2)Y(WDP =U 


i (Wr)? ar 


Recall that [W(1)]* is a y?(1) variable. The probability that a y?(1) variable 
is less than unity is 0.68, and since the denominator of [17.4.7] must be positive, 
the probability that 6; — 1 is negative approaches 0.68 as T becomes large. In 
other words, in two-thirds of the samples generated by a random walk, the estimate 
pr will be less than the true value of unity. Moreover, in those samples for which 
(W(1)}? is large, the denominator of [17.4.7] will be large as well. The result is that 
the limiting distribution of T(6; — 1) is skewed to the left. 

Recall that in the stationary case when |p| < 1, the estimate fy is downward- 
biased in small samples. Even so, in the stationary case the limiting distribution 
of VT (6r — p) is symmetric around zero. By contrast, when the true value of p 
is unity, even the limiting distribution of T(é; — 1) is asymmetric, with negative 
values twice as likely as positive values. 

In practice, critical values for the random variable in [17.4.7] are found by 
calculating the exact small-sample distribution of T(6, — 1) for given T, assuming 
that the innovations {u} are Gaussian. This can be done either by Monte Carlo, as 
in the critical values reported in Fuller (1976), or by using exact numerical pro- 
cedures described in Evans and Savin (1981). Sample percentiles for T(6; — 1) 
are reported in the section labeled Case 1 in Table B.5 of Appendix B. For finite 
T, these are éxact only under the assumption of Gaussian innovations. As T be- 
comes large, these values also describe the asymptotic distribution for non-Gaussian 
innovations. 

It follows from [17.4.7] that 6, is a superconsistent estimate of the true value 
(p = 1). This is easily seen by dividing [17.4.3] by VT: 


e 
3/2 
it Dy Year Mi 


T 
i Dy Yr 
i= 


T(r - 1) [17.4.7] 


VT(67 — 1) = [17.4.8] 


From Proposition 17.1(b), the numerator in [17.4.8] converges to T~ “(1/2)o" times 
(X — 1), where X is a y?(1) random variable. Since a y?(1) variable has finite 
variance, the variance of the numerator in [17.4.8] is of order 1/T, meaning that 
the numerator converges in probability to zero. Hence, 


VT(6r - 1)> 0. 


Result [17.4.7] allows the point estimate 6, to be used by itself to test the 
null hypothesis of a unit root, without needing to calculate its standard error. 
Another popular statistic for testing the null hypothesis that p = 1 is based on the 
usual OLS ¢ test of this hypothesis, 


peers ee [17.4.9] 
Cp, 
{3 wD, vii} 


where @, is the usual OLS standard error for the estimated coefficient, 
T v2 
Op; = {s+ dt} , 


488 Chapter 17 | Univariate Processes with Unit Roots 


and s3- denotes the OLS estimate of the residual variance: 
T 


Ss} = (y 7” Bry, UT 3 1). 


t= 


Although the ¢ statistic [17.4.9] is calculated in the usual way, it does not have a 
limiting Gaussian distribution when the true process is characterized by p = 1. To 
find the appropriate limiting distribution, note that [17.4.9] can equivalently be 
expressed as 


T v2 
7 = T(6r - fr? » vi} + {53}!2, (17.4.10] 
or, substituting from [17.4.3], 
s 
y >» Yr, 
oe a 
{r- p> vii} {53} 
ail 


As in Section 8.2, consistency of 6; implies 53 - o. It follows from [17.4.5] and 
[17.4.6] that as T —> ©, 


4 ODOR = _ CMO = gery 
{of (Wr)? ar} {o}!2 {f (W(r)? a 


Statistical tables for the distribution of [17.4.11] for various sample sizes T are 
reported in the section labeled Case 1 in Table B.6; again, the small-sample results 
assume Gaussian innovations. 


(17.4.11] 


i= 


T 


Example 17.3 
The following AR(1) process for the nominal three-month U.S. Treasury bill 
rate was fitted by OLS regression to quarterly data, ¢ = 1947:II to 1989:1: 

i, = 0.99694 i,_,, [17.4.13] 


(0.010592) 


with the standard error of 6 in parentheses. Here T = 168 and 
T(é — 1) = (168)(0.99694 - 1) = —0.51. 


The distribution of this statistic was calculated in [17.4.7] under the assumption 
that the true value of p is unity. The null hypothesis is therefore that p = 1, 
and the alternative is that p < 1. From Table B.5, in a sample of this size, 
95% of the time when there really is a unit root, the statistic T(é — 1) will 
be above ~ 7.9. The observed value (— 0.51) is well above this, and so the null 
hypothesis is accepted at the 5% level and we should conclude that these data 
might well be described by a random walk. 

In order to have rejected the null hypothesis for a sample of this size, 
the estimated autoregressive coefficient 6 would have to be less than 0.95: 


168(0.95 — 1) = —8.4. 
The OLS t test of Hy: p = 1 is 
t = (0.99694 — 1)/0.010592 = —0.29. 
This is well above the 5% critical value from Table B.6 of — 1.95, so the null 


17.4, Asymptotic Properties of a First-Order Autoregression 489 


hypothesis that the Treasury bill rate follows a random walk is also accepted 
by this test. 


The test statistics [17.4.7] and [17.4.12] are examples of the Dickey-Fuller test 
for unit roots, named for the general battery of tests proposed by Dickey and Fuller 
(1979). 


Case 2. Constant Term but No Time Trend Included 
in the Regression; True Process Is a Random Walk 


For case 2, we continue to assume, as in case 1, that the data are generated 
by a random walk: 


y= Y-1 + uy, 


with u, i.i.d. with mean zero and variance a”. Although the true model is the same 
as in case 1, suppose now that a constant term is included in the AR(1) specification 
that is to be estimated by OLS: 


y= @ + PY1-1 ot U,. (17.4.14] 


The task now is to describe the properties of the OLS estimates, 


-1 
is >» 
[4] 2 : a | | 4 iF [17.4.15] 
+ Pr Ly, Zy?_, Ly,-1Yr 


under the null hypothesis that a = 0 and p = 1 (here © indicates summation over 
t= 1,2,..., T). Recall the familiar characterization in [8.2.3] of the deviation 
of an estimated OLS coefficient vector (b7) from the true value (B), 


br - B= b sxi] b x (17.4.16] 


t=! 


“1 T Sy, 5 Lu 
oan RM 17.4.17 
F s A ea | ZLy,_1U, 


As in case 1, y, has the same properties as the variable é, described in Prop- 
osition 17.1 under the maintained hypothesis. Thus, result (d) of that proposition 
establishes that the sum Zy,_, must be divided by T°” before obtaining a random 
variable that converges in distribution: 


or, in this case, 


I 
T-3y,_, 3 0- [ W(r) dr. (17.4.18] 
In other words, 
Zy,-, = O,(T*”). 
Similarly, results [17.4.5] and [17.4.6] establish that 
Zy,-14, = O,(T) 
zy? = O,(T?), 
and from Proposition 17.1(a), 
Lu, = O,(T'”). 


490 Chapter 17 | Univariate Processes with Unit Roots 


Thus, the order in probability of the individual terms in (17.4.17] is as follows: 


i O O,(T2)]~'[O,(T? 
, me = oT) of : 1 of ). (17.4. 19] 
br - 1 O,(T??)_O,(T?) O,(T) 
It is clear from [17.4.19] that the estimates &, and 6, have different rates 
of convergence, and as in the previous chapter, a scaling matrix Y; is helpful in 


describing their limiting distributions. Recall from (16.1.18] that this rescaling is 
achieved by premultiplying [17.4.16] by Y; and writing the result as 


T ze T 
Y¥7(br — B) = | 3 sai] viv'| 5 su 


feof Ze)oo} feo[So]} 


From [17.4.19], for this application Y; should be specified to be the following 
matrix: 


(17.4.20] 


2. 
Y;= E a, (17.4.21] 


for which [17.4.20] becomes 


Poe elle al 
«{[0" A) [, ul} 


| Ti2q, 1 fe i To ?Su, (17.4.29] 
T(6r — 1) T~*?3y,-, T-*2y?, T- "Sy, uJ a 

Consider the first term on the right side of [17.4.22]. Results [17.4.6] and 
(17.4.18] establish that 


1 ee 
T~*?3y,-, T7*2y?_, 
Fy 1 o| W(r) dr 
pire 
o-| Wr) dr o-[ (Wr) dr 


{i :) 1 [ wear : | 


i W(r) dr | (wine ar| LO od? 


or 


(17.4.23] 


where the integral sign denotes integration over r from 0 to 1. Similarly, result (a) 
of Proposition 17.1 along with [17.4.5] determines the asymptotic distribution of 


17.4, Asymptotic Properties of a First-Order Autoregression 491 


the second term in [17.4.22]: 


| T~3u, |4 o- W(1) 
T~'Sy,_u, (1/2)o7{[W(1)P — 1} 


(17.4.24] 
_ fi 0 W(1) 
~ *L0 o} Laaywaype = 1)" 
Substituting [17.4.23] and [17.4.24] into [17.4.22] establishes 
Tay exalt Af 1 J wear | 
T(r — 1) Oo | W(r) dr [ wor dr 
mE BE i | w(1) 
0 o| Lo of Laaqwar - 3 
ie (17.4.25] 
_ i 7 1 [ wo dr 
om | W(r) dr | [W(r)}? dr 
& W(1) 
(1/2) (WL)P - UY 
Notice that 
1 W(r) dr ui (W(r)P? dr -| W(r) dr 
| (17.4.26] 
i; W(r) dr | (W(r)P dr -| W(r) dr 1 
where 
2 
As { [Wn]? ar - | Wir) a] (17.4.27] 


Thus, the second element in the vector expression in [17.4.25] states that 
4[WDP - 4 - wey: { W(r) dr 


| (W(r)* dr - | | Wir) ar) 


Neither estimate &; nor 7 has a limiting Gaussian distribution. Moreover, 
the asymptotic distribution of the estimate of p in [17.4.28] is not the same as the 
asymptotic distribution in [17.4.7]— when a constant term is included in the dis- 
tribution, a different table of critical values must be used. 

The second section of Table B.5 records percentiles for the distribution of 
T(67 — 1) for case 2. As in case 1, the calculations assume Gaussian innovations, 
though as T becomes large, these are valid for non-Gaussian innovations as well. 


Tér- > [17.4.28] 


492 Chapter 17 | Univariate Processes with Unit Roots 


Notice that this distribution is even more strongly skewed than that for case 1, so 
that when a constant term is included in the regression, the estimated coefficient 
on y,.., must be farther from unity in order to reject the null hypothesis of a unit 
root. Indeed, for T > 25, 95% of the time the estimated value 6; will be less than 
unity. For example, if the estimated value 6; is 0.999 in a sample of size T = 100, 
the null hypothesis of p = 1 would be rejected in favor of the alternative that 
p > I! If the true value of p is unity, we would not expect to obtain an estimate 
as large as 0.999. 

Dickey and Fuller also proposed an alternative test based on the OLS ¢ test 
of the null hypothesis that p = 1: 


pr-1 
=o, (17.4.29] 
. op, 
where 
-! 
T 3y,- 0 
F2 = 52 ist 17.4.30 
a eae ac] H ey 
T 
Ss} = (T _ 2)7! >» (y, _ ar a Pry,-\)°- 


Notice that if both sides of [17.4.30] are multiplied by 72, the result can be written 


as 
~! 
T Sy,-1 0 
T?- 63, = 530 ni 7 | 
Ne TN yet Bye Ee [17.4.31] 
T ZYr-1 0 
=~¢2 

si Ws), o aH 


for Y; the matrix in [17.4.21]. Recall from [17.4.23] that 


T =| a 
= v= Y=! 
{ 2, Dyer Lye : 
1 T33y,_, =. (17.4.32] 
2 sty 4 


: i a 1 [ ww dr 


10)” 
pee [ ww dr [ wor dr E 1 , 


Thus, from [17.4.31], 
~t 


1 il W(r) dr 


0 
] | L [17.4.33] 
| Wr) dr | (W(r)]}? dr 


2,62 / 62 =! 
T?-63,. s7[0 ao be 


It is also easy to show that 


53-4 o?, (17.4.34] 


17.4. Asymptotic Properties of a First-Order Autoregression 493 


from which [17.4.33] becomes 


1 | W(r) dr 


T?- 55, & (0 1] 
| J ae | (mene ae] MS (17.4.35] 
1 
See ee 
{ [W(r)P ar - [ | W(r) i] 
Thus, the asymptotic distribution of the OLS ¢ test in [17.4.29] is 
T(ér — 1) » 2412 
fr = {T?-63, 2° T(r — 1) x {f [Win]? dr - if Wir) ar| } 
(17.4.36] 


| Mowe = 1) - way: | we a 


i {{ (W(r)? dr - J Wr) ar} \" 


Sample percentiles for the OLS ¢ test of p = 1 are reported for case 2 in the 
second section of Table B.6. As T grows large, these approach the distribution in 
the last line of [17.4.36]. 


Example 17.4 
When a constant term is included in the estimated autoregression for the 
interest rate data from Example 17.3, the result is 
i, = 0.211 + 0.96691 i,_,, (17.4.37] 
(4.112) (4.19133) 
with standard errors reported in parentheses. The Dickey-Fuller test based on 
the estimated value of p for this specification is 


T(é — 1) = (168)(0.96691 - 1) = —5.56. 


From Table B.5, the 5% critical value is found by interpolation to be — 13.8. 
Since — 5:56 > — 13.8, the null hypothesis of a unit root (p = 1) is accepted 
at the 5% level based on the Dickey-Fuller 6 test. The OLS ¢ statistic is 


(0.96691 — 1)/0.019133 = — 1.73, 


which from Table B.6 is to be compared with —2.89. Since — 1.73 > —2.89, 
the null hypothesis of a unit root is again accepted. 


These statistics test the null hypothesis that p = 1. However, a maintained 
assumption on which the derivation of [17.4.25] was based is that the true value 
of @ is zero. Thus, it might seem more natural to test for a unit root in this 
specification by testing the joint hypothesis that a = 0 and p = 1. Dickey and 
Fuller (1981) used Monte Carlo to calculate the distribution of the Wald form of 
the OLS F test of this hypothesis (expression [8.1.32] or [8.1.37]). Their values 
are reported under the heading Case 2 in Table B.7. 


Example 17.5 
The OLS Wald F statistic for testing the joint hypothesis that a = 0 and p = 1 
for the regression in [17.4.37] is 1.81. Under the classical regression assump- 


494 Chapter 17 | Univariate Processes with Unit Roots 


tions, this would have an F(2, 166) distribution. In this case, however, the 
usual statistic is to be compared with the values under Case 2 in Table B.7, 
for which the 5% critical value is found by interpolation to be 4.67. Since 1.81 
< 4.67, the joint null hypothesis that a = 0 and p = 1 is accepted at the 5% 
level. 


Case 3. Constant Term but No Time Trend Included 
in the Regression; True Process Is Random Walk with Drift 


In case 3, the same regression [17.4.14] is estimated as in case 2, though now 
it is supposed that the true process is a random walk with drift: 


y=aty_,~+ u, (17.4.38] 


where the true value of a@ is not zero. Although this might seem like a minor change, 
it has a radical effect on the asymptotic distribution of & and f. To see why, note 
that [17.4.38] implies that 


Y= Yo batt (uy tuyt-++ +n) =yy tatt &,  [17.4.39] 
where 
€,=uU, +b tes +4, fort = 1,2,...,T 


with & =0. 
Consider the behavior of the sum 


T T 
Dyer = DB Dot ae - + Sh [17.4.40] 


The first term in [17.4.40] is just Ty, and if this is divided by T, the result will be 
a fixed value. The second term, Za(¢ — 1), must be divided by 7? in order to 
converge: 


* 
T72 > a(t - 1) > @/2, 
amt 
by virtue of Proposition 17.1(h). The third term converges when divided by Te: 


Tr \ 
T72 > €_, = o[ W(r) dr, 
r=1 


from Proposition 17.1(d). The order in probability of the three individual terms in 
(17.4.40] is thus 


T T T T 


——e a 
OT) Op(T?) O,(T?) 


The time trend a(t — 1) asymptotically dominates the other two components: 
T T 


. 
T72 ~1= T7'y) + T7? 2, at — 1+ ee b- } 
D Yat Yo 2 at - 1) 2, 6-1 (17.4.41] 


i=l tml 


40+ a2 +0. 


17.4. Asymptotic Properties of a First-Order Autoregression 495 


Similarly, we have that 

T T 

2, yi = > [yo + a - 1+ é-P 
= “™ 


T T T 

= Dv + Dore - ie + D 1 
t= = = 
—— oT —— 
O,(T) OT) 0,(T) 


T T T 
+ D 2yoa(e — 1) + D 2yogi-1 + D Zale - EAs. 
——$— rey ——" ert 
Op(T?) Ont?) OT) 
When divided by T?, the only term that does not vanish asymptotically is that due 


to the time trend a(t — 1)?: 
r 
TOD y2., 5 eA. (17.4.42] 
t=! 
Finally, observe that 
T T 
2, Yeo = Di Lo + ae — 1) + € ae 
T T a 
= Yo >> ue + > a(t am 1)u, + > Uy, 
- ™ = 


en ere ere Ne ponerse! 


oft!) OaT*) OAT) 
from which 
T T 
T7392) y,_ 9 > T-*? DY) alt — Iu. (17.4.43] 
t= m=! 


Results [17.4.41] through [17.4.43] imply that when the true process is a 
random walk with drift, the estimated OLS coefficients in [17.4.15] satisfy 


ee = Al 2 ee He ee 
ér-1] LO,(T) 0,7] LO,(T?)]" 


Thus, for this case, the Sims, Stock, and Watson scaling matrix would be 


v= [4 pal: 
for which [17.4.20] becomes 
een 
0 is br—-1 
(oF lle, Sele A 
ie 6 a a Sal |) sec  a 
{[7o" 2] [sr al} 
0 Ps Ly, Uy 


496 Chapter 17 | Univariate Processes with Unit Roots 


or 


Pe = _ 1 et | ees |: [17.4.44] 


P"(6; — 1) T~*2y,-, Ty? T~7PZy,_ Uy 
From [17.4.41] and [17.4.42], the first term in [17.4.44] converges to 
1 T *Xy,-,] e | 1 a2 
=Q. 17.4. 
Ee Hae 7 E zl 2 eee) 


From [17.4.43] and (17.3.18], the second term in [17.4.44] satisfies 


T-'?Sn, | To '23y, 
T~*23y,_\u, T~*?Za(t - 1)u, 


4 n([9]. ol a) (17.4.46] 


Combining [17.4.44] through [17.4.46], it follows that 


T'2(4, ~ @) 
T°(, - 1) 


Thus, for case 3, both estimated coefficients are asymptotically Gaussian. In 
fact, the asymptotic properties of &, and 6; are exactly the same as those for &; 
and 8, in the deterministic time trend regression analyzed in Chapter 16. The 
reason for this correspondence is very simple: the regressor y,_, is asymptotically 
dominated by the time trend a-(¢ — 1). In large samples, it is as if the explanatory 
variable y,.; were replaced by the time trend @-(¢ — 1). Recalling the analysis of 
Section 16.2, it follows that for case 3, the standard OLS ¢ and F statistics can be 
calculated in the usual way and compared with the standard tables (Tables B.3 and 
B.4, respectively). 


|4 NO, Q-!-0?Q-Q-') = N(O, o?Q-'). [17.4.47] 


Case 4. Constant Term and Time Trend Included 
in the Regression; True Process Is Random Walk 
With or Without Drift 


Suppose, as in the previous case, that the true model is 
ys =a + Yr-4 + Uu,, 


where u, is i.i.d. with mean zero and variance o. For this case, the true value of 
a turns out not to matter for the asymptotic distribution. In contrast to the previous 
caSe, we now assume that a time trend is included in the regression that is actually 
estimated by OLS: 


yy = @+ py, + St + uy. (17.4.48] 


If a + 0, y,_; would be asymptotically equivalent to a time trend. Since a time 
trend is already included as a separate variable in the regression, this would make 
the explanatory variables collinear in large samples. Describing the asymptotic 


17.4. Asymptotic Pronerties of a Firct-Ordor Antnvanveccin= AQT 


distribution of the estimates therefore requires not just a rescaling of variables but 
also a rotation of the kind introduced in Section 16.3. 
Note that the regression model of [17.4.48] can equivalently be written as 


(1 - p)a + ply. — a(t — 1] + (8 + padt + u, 
=a* + pté,_, + St + uw, 


4 [17.4.49] 


where a* = (1 — p)a, p* = p, &* = (6 + pa), and €,=y, — at. Moreover, under 
the null hypothesis that p = 1 and 6 = 0, 
& = Yot Wy tug torr tuys 


that is, , is the random walk described in Proposition 17.1. Consider, as in Section 
16.3, a hypothetical regression of y,on a constant, £,_ ,, and a time trend, producing 
the OLS estimates 


ay T 2g,- i} at A zy, 
pF = 2G,- 1 Dé7_ i} 2g. if 2G, _ Yr]. [17.4.50] 
és St Ste, Se? Sty, 


The maintained hypothesis is that a = @, p = 1, and 6 = 0, which in the 
transformed system would mean a* = 0, p* = 1, and &* = ay. The deviations of 
the OLS estimates from these true values are given by 


at T SE, St Vy Su, 
pap—- 1 | = ]2S-1 Sey SE-t] | 2E-1m|. —— [17.4.51] 
8% — ay St Se, Se? Stu, 


Consulting the rates of convergence in Proposition 17.1, in this case the scaling 


matrix should be 
Tie 0 0 
Y;= 0 T Of, 
0 0 T 
and [17.4.20] would be 


T2 90 0 ap 
0 =T e 67-1 
ee 57 — 
T-'2 Q 0 T > a xt 


= 0 ro 0 2G, Zé, Beit 
0 QT | Se. Se. SP 


498 Chapter 17 | Univariate Processes with Unit Roots 


or 


T!?at 1 T 2PDE = T~73t a 
T(6; — 1) | = | TORZE_, TLE, rt «| 
T**(5% — a) T7*3t 2 TO Ete _, T~*St? [17.4.52] 
T~ "Su, 
x frac 
T~*?3tu, 
The asymptotic distribution can then be found from Proposition 17.1: 


Ta} 
T (6% - 1) 
T*7(55 — &) 


-1 


1 o i W() dr j 
Bs o{ wo dr o-| IWF dr o- | rw) dr 
' o- { rW() ar i 
o-W(1) 


x] so%W()P - 3 
ao {W(1) - [wo dr} 


a 1 | W(r) dr 4 
1 0 0 
= oa o | [ ww dr [ wor dr [ wi dr {17.4.53] 
001 
4 [wo dr 4 
W(1) 


100] '[1 00 
x]0o 0 0 « 0 sw)? -— 

001 001 
way - | wo dr 


1 { W(r) dr 4 


a0 0 0 
= [° 1 | [ wo dr [ wore [ wo dr 


00 ¢ 
4 | rW(r) dr 4 


W (1) 
x | H[W(I)P — 1} 


W(i) - [wo dr 


17.4, Asymptotic Properties of a First-Order Autoregression 499 


Note that £*, the OLS estimate of p based on [17.4.49], is identical to A,, 
the OLS estimate of p based on [17.4.48]. Thus, the asymptotic distribution of 
T(pr — 1) is given by the middle row of [17.4.53]. Note that this distribution does 
not depend on either o or a; in particular, it does not matter whether or not the 
true value of @ is zero. 


The asymptotic distribution of 6,,, the OLS standard error for 6, can be 


found using similar calculations to those in [17.4.31] and [17.4.32]. Notice that 
-! 


T > aa =t 0 
T?-63,= T?-s3[0 1 OSE, S71. BE aut 1 
St =Sté_, = Se? 0 


T'2 0 0 
s3f0 1 Of 0 T O 
0 0 T# 


T 3-1 =] ‘tre go o fo 
x 2G, 1 pram 2G,_ jt 0 T 0 1 
St Ste, Se? 0 0 T”)(l0 


s}[0 1 0] 
1 T-*?SE_, T23t 


0 
x|T~8?20E_, TO 7EER, = TPE _ yt H 
T 7S TORS, 2 TFL? 


: £0: 0) 
>o070 1 01/0 ao 0 
00 1 


1 i W(r) dr } 1 
7 [ ww dr [ wor dr [ wo dr | 
3 [ wo dr + 


[17.4.54] 


oo 
oqo 
roo 

i} 
oro 
eee 


1 | W(r) dr 4 r 
=(0 1 9] i W(r) ar | [W(r)}2 dr | rw(ndr| |} 
4 [ wo dr 5 


=Q@. 


From this result it follows that the asymptotic distribution of the OLS t test 
of the hypothesis that p = 1 is given by 


tr = T(br - 1) + (1?-6},)"* > TG, - 1) + VO. [174.55] 
Again, this distribution does not depend on @ or o. The small-sample distribution 
of the OLS t statistic under the assumption of Gaussian disturbances is presented 
under case 4 in Table B.6. If this distribution were truly t, then a value below —2.0 
would be sufficient to reject the null hypothesis. However, Table B.6 reveals that, 
because of the nonstandard distribution, the ¢ statistic must be below — 3.4 before 
the null hypothesis of a unit root could be rejected. 


500 Chapter 17 | Univariate Processes with Unit Roots 


The assumption that the true value of 6 is equal to zero is again an auxiliary 
hypothesis upon which the asymptotic properties of the test depend. Thus, as in 
case 2, it is natural to consider the OLS F test of the joint null hypothesis that 
6 = Oand p = 1. Though this F test is calculated in the usual way, its asymptotic 
distribution is nonstandard, and the calculated F statistic should be compared with 
the value under case 4 in Table B.7. 


Summary of Dickey-Fuller Tests in the Absence 
of Serial Correlation 


We have seen that the asymptotic properties of the OLS estimate 6; when 
the true value of p is unity depend on whether or not a constant term or a time 
trend is included in the regression that is estimated and on whether or not the 
random walk that describes the true process for y, includes a drift term. These 
results are summarized in Table 17.1. 

Which is the ‘correct’ case to use to test the null hypothesis of a unit root? 
The answer depends on why we are interested in testing for a unit root. If the 
analyst has a specific null hypothesis about the process that generated the data, 
then obviously this would guide the choice of test. In the absence of such guidance, 
one general principle would be to fit a specification that is a plausible description 
of the data under both the null hypothesis and the alternative. This principle would 
suggest using the case 4 test for a series with an obvious trend and the case 2 test 
for series without a significant trend. 

For example, Figure 17.2 plots the nominal interest rate series used in the 
examples in this section. Although this series has tended upward over this sample 
period, there is nothing in economic theory to suggest that nominal interest rates 
should exhibit a deterministic time trend, and so a natural null hypothesis is that 
the true process is a random walk without trend. In terms of framing a plausible 
alternative, it is difficult to maintain that these data could have been generated by 
i, = pi,_, + u, with |p| significantly less than 1. If these data were to be described 
by a stationary process, surely the process would have a positive mean. This argues 
for including a constant term in the estimated regression, even though under the 
null hypothesis the true process does not contain a constant term. Thus, case 2 is 
a sensible approach for these data, as analyzed in Examples 17.4 and 17.5. 

As a second example, Figure 17.3 plots quarterly real GNP for the United 
States from 1947:1 to 1989:I. Given a growing population and technological im- 
provements, such a series would certainly be expected to exhibit a persistent upward 
trend, and this trend is unmistakable in the figure. The question is whether this 
trend arises from the positive drift term of a random walk: 


Hy yy, = at yy + a>Q0, 
or from a deterministic time trend added to a stationary AR(1): 
Hyry,= a+ 6+ py, tu, |p| <i. 


Thus, the recommended test statistics for this case are those described in case 4. 
The following model for 100 times the log of real GNP (denoted y,) was 
estimated by OLS regression: 


y, = 27.24 + 0.96252 y,_, + 0.02753 t. ~ [17.4.56] 


(13.53) (0.019304) (0.01521) 


(standard errors in parentheses). The sample size is T = 168. The Dickey-Fuller 


17,4. Asymptotic Properties of a First-Order Autoregression 501 


p test 1s 
T(6 — 1) = 168(0.96252 — 1.0) = -6.3. 


Since —6.3 > —21.0, the null hypothesis that GNP is characterized by a random 
walk with possible drift is accepted at the 5% level. The Dickey-Fuller ¢ test, 


0.96252 — 1.0 
= ~oo1s04 


exceeds the 5% critical value of —3.44, so that the null hypothesis of a unit root 
is accepted by this test as well. Finally, the F test of the joint null hypothesis that 
& = Oand p = 1 is 2.44. Since this is less than the 5% critical value of 6.42 from 
Table B.7, this null hypothesis is again accepted. 


TABLE 17.1 
Summary of Dickey-Fuller Tests for Unit Roots in the Abseuce 
of Serial Correlation 


Case 1: 
Estimated regression: y, = py,., + 4, 
True process: y, = y,_, + u, uu, ~ i.i.d. N(0, 0?) 
T(67 — 1) has the distribution described under the heading Case 1 in Table 


B.S. 
(67 — 1)/6,, has the distribution described under Case 1 in Table B.6. 


Case 2: 


Estimated regression: y, = a@ + py,-1 + 4, 

True process: y, = y,_, + u, uu, ~ iid. N(O, 0?) 

T(67 — 1) has the distribution described under Case 2 in Table B.5. 

(67 — 1)/6,,, has the distribution described under Case 2 in Table B.6. 

OLS F test of joint hypothesis that a = 0 and p = 1 has the distribution 
described under Case 2 in Table B.7. 


Case 3: 


Estimated regression: y, = @ + py,-, + u, 
True process: y,=at+y,,+u, @ #0,u,~ iid. (0, 0’) 
(Br — 1/65, > N(QO, 1) 

Case 4; 


Estimated regression: y, = @ + py,-, + 6t + u, 

True process: y, = @+y,.; tu,  q@any, u, ~ iid. N(O, o) 

T(67 — 1) has the distribution described under Case 4 in Table B.S. 

(67 — 1)/6,, has the distribution described under Case 4 in Table B.6. 

OLS F test of joint hypothesis that p = 1 and 6 = Q has the distribution 
described under Case 4 in Table B.7. 


Notes to Table 17.1 

Estimated regression indicates the form in which the regression is estimated, using observations 
t= 1,2...., Tand conditioning on observation ¢ = 0. 

True process describes the null hypothesis under which the distribution is calculated. 

fy is the OLS estimate of p from the indicated regression based on a sample of size T. 

(6+ — ids, is the OLS ¢ test of p = 1. 

OLS F test of a hypothesis involving two restrictions is given by expression [17.7.39]. 

If u, ~ i.u.d. N(O, 77), then Tables B.5 through B.7 give Monte Carlo estimates of the exact 
small-sample distribution. The tables are also valid for large T when u, is non-Gaussian i.i.d. as well 
as for certain heterogencously distributed serially uncorrelated processes. For serially correlated u,, see 
Table 17.2 or 17.3. 


502 Chapter 17 | Univariate Processes with Unit Roots 


0 ee 
a7 St! SS sg 63 67 7A 7S 79 83 87 


FIGURE 17.2 U.S. nominal interest rate on 3-month Treasury bills, data sampled 
quarterly but quoted at an annual rate, 1947:I to 1989:1. 


4500 


4000 5 
3500 4 
3000 J 
2500 + 
2000 1 


1800 


47 $] SS $9 63 67 7! 7S 79 63 87 


FIGURE 17.3 U.S. real GNP, data sampled quarterly but quoted at an annual 
rate in billions of 1982 dollars, 1947:I to 1989:I. 


17.4, Asymptotic Properties of a First-Order Autoregression 503 


Of the tests discussed so far, those developed for case 2 seem appropriate 
for the interest rate data and the tests developed for case 4 seem best for the GNP 
data. However, more general tests presented in Sections 17.6 and 17.7 are to be 
preferred for describing either of these series. This is because the maintained 
assumption throughout this section has been that the disturbance term u, in the 
regression is i.i.d. There is no strong reason to expect this for either of these time 
series. The next section develops results that can be used to test for unit roots in 
serially correlated processes. 


17.5. Asymptotic Results for Unit Root Processes 
with General Serial Correlation 


This section generalizes Proposition 17.1 to allow for serial correlation. The fol- 
lowing preliminary result is quite helpful. 


Proposition 17.2: Let 


u, = WLe, = > ve-p [17.5.1] 
j=0 
where 
* E(e,) = 0 
o? fort=T 
Blam) = {6 otherwise 
> ily < . [17.5.2] 
=o 
Then 


uy tu, tees tu, = W()-(e, + & + +++ +e) + 7, - Mm ~ [17.5.3] 


where (1) = Zea, m = Dfeoae,—;, & = —(War + Yor + Baa + °°), and 
Zola <a. 


The condition in [17.5.2] is slightly stronger than absolute summability, though 
it is satisfied by any stationary ARMA process. 
Notice that if y, is an /(1) process y, whose first difference is given by u,, or 


Ay, = u,, 
then 
Y= Uy tug tees + + Yo = WL) (a, + eg ++ +e) + Hm — M0 + Yo 


Proposition 17.2 thus states that any /(1) process whose first difference satisfies 
[17.5.1] and [17.5.2] can be written as the sum of a random walk ((1):(e, + 
& + ++: + &)), initial conditions (yy — 7), and a stationary process (7,). This 
observation was first made by Beveridge and Nelson (1981), and [17.5.3] is some- 
times referred to as the Beveridge-Nelson decomposition. 

Notice that 7, is a stationary process. An important implication of this is that 
if [17.5.3] is divided by V4, only the first term (1/Vt)y(1)- (2; + e + °-* +) 
should matter for the distribution of (1/Vt)-(u, + uz + -°* + u) ast, 

As an example of how this result can be used, suppose that X;(r) is defined 


504 Chapter 17 | Univariate Processes with Unit Roots 


as in [17.3.2]: 
V7rl* 

X70) = (UT) Su, [17.5.4] 
where u, satisfies the conditions of Proposition 17.2 with ¢, i.i.d. and E (24) < &. 
Then the continuous-time process VT-X;(r) converges to o- y{1) times standard 
Brownian motion: 


VT-X;,(-) 5 o- (1) WC). [17.5.5] 


To derive [17.5.5], note from Proposition 17.2 that 


[TJ* 
VT-X;(r) = (VT): > u, 


[Tr]* 


= W(1)-(QhV/T)- oy e, + (UVT) «(nr — t) [17.5.6] 


t=t 
(Tr]* 
= o(1)-(WVT): & 2, + S7(r), 
t=1 
where we have defined S;(r) = (UVT) (17° — m). Notice as in Example 17.2 
that 


S7(-) 40 [17.5.7] 
as T—> e, Furthermore, from [17.3.8], 
\7-|* 

QINV/T): Se, o Wr). [17.5.8] 
twl 


Substituting [17.5.7] and [17.5.8] into [17.5.6] produces [17.5.5]. 
Another implication is found by evaluating the functions in [17.5.5] at 
r=1: 
z L 
(UVT) > u, > o- W(1)- WL). [17.5.9] 
tal! 
Since W(1) is distributed (0, 1), result [17.5.9] states that 
T 
(VT) Yu NO, OVP). 
which is the usual central limit theorem of Proposition 7.11. 
The following proposition uses this basic idea to generalize the other results 


from Proposition 17.1; for details on the proofs, see Appendix 17.A. 


Proposition 17.3: Let u, = y(L)e, = Djoo Wyer-;, where D7. j-|yj| < © and {e,} 
is an iid. sequence with mean zero, variance o, and finite fourth moment. Define 


y; = E(u,u,j) = 0? 2, Wes;  forj=0,1,2,... — [17.5.10] 


A= oD, bj = a o(1) 
&=u,tu,t--++u, fort=1,2,...,T [17.5.11] 
with &, = 0. Then 


17.5, Unit Root Processes with General Serial Correlation 505 


T 
(a) T-2 > ua WL); 
t=t 
is L 
(b) por > U,_j;&,— N(0, oy) for j = 1, 2, seed 
ral 


() TT? Duwj>y forj=0,1,2,...3 
t=t 


@) ry E18, > (2)0-A{[W(1)F - Us 
(e) TO! > &, Uy —] 
ea — ye} forj=0 
(1/2){A?-[W1)]? — yok + Yo tM t+ tet + Y%H1 
forj=i1,2,...3 


fr 


) Tye. Sa]) wears 
(gs) 7-3 ¥ tu, > a {wa - [ W(r) ar} forj=0,1,2,...; 
@) 7 Se Soe] WOOP ars 
() T-52 $ Paes rf rW(r) dr; 
t=1 0 
T L } 
G) TS Saf WOR ar: 
t= a 


T 
(kK) T-e*OS tro v4 1) forv=0,1,.... 


=] 


Again, there are simpler ways to describe individual results; for example, (a) 
is a N(O, A?) distribution, (d) is (1/2)oA+[xy?(1) — 1]. and (f) and (g) are both 
N(O, A?/3) distributions. 

These results can be used to construct unit root tests for serially correlated 
observations in two ways. One approach, due to Phillips (1987) and Phillips and 
Perron (1988), is to continue to estimate the regressions in exactly the form indi- 
cated in Table 17.1. but to adjust the test statistics to take account of serial cor- 
relation and potential heteroskedasticity in the disturbances. This approach is de- 
scribed in Section 17.6. The second approach, due to Dickey and Fuller (1979), is 
to add lagged changes of y as explanatory variables in the regressions in Table 
17.1, This is described in Section 17.7. 


17.6. Phillips-Perron Tests for Unit Roots 


Asymptotic Distribution for Case 2 Assumptions with Serially 
Correlated Disturbances 


To illustrate the basic idea behind the Phillips (1987) and Phillips and Perron 
(1988) tests for unit roots, we will discuss in detail the treatment they propose for 
the analog of case 2 of Section 17.4. After this case has been reviewed, similar 


506 Chapter 17 | Univariate Processes with Unit Roots 


results will be stated for case 1 and case 4, with details developed in exercises at 
the end of the chapter. 


Case 2 of Section 17.4 considered OLS estimation of a and p in the regression 
model 


y, = @ + py,_, + U, [17.6.1] 
under the assumption that the true a = 0, p = 1, and u, isi.i.d. Phillips and Perron 
(1988) generalized these results to the case when u, is serially correlated and possibly 
heteroskedastic as well. For now we will assume that the true process is 

OM TR Fa = u(L)e,, 
where #(L) and e, satisfy the conditions of Proposition 17.3. More general con- 
ditions under which the same techniques are valid will be discussed at the end of 
this section. 

If [17.6.1] were a stationary autoregression with |p| < 1, the OLS estimate 
67in[17.4.15] would not give a consistent estimate of p whenu, is serially correlated. 
However, if p is equal to 1, the rate T convergence of 6, turns out to ensure that 
fr—> 1 even when 4, is serially correlated. Phillips and Perron (1988) therefore 
proposed estimating [17.6.1] by OLS even when yu, is serially correlated and then 
modifying the statistics in Section 17.4 to take account of the serial correlation, 

Let @; and 6, be the OLS estimates based on [17.6.1] without any correction 
for serial correlation; that is, d; and g; are the magnitudes defined in [17.4.15]. 
If the true values are a = 0 and p = 1, then, as in [17.4.22], 


T%4, | 1 T~3?3y,_,]-'f To '?3u, 
i =!» . , [17.6.2] 
T(6r — 1) T-*Qy,_, T-*By?_, To '2y,_ 14, 
where © denotes summation over ¢ from 1 to T. Also, under the null hypothesis 
that a = Oand p = 1, it follows as in [17.4.4] that 
Y= Yo tu tla tes + u, 

Ifu, = 4(L)e, as in Proposition 17.3, then y, is the variable labeled €, in Proposition 
17.3, plus the inconsequential value y,. Using results (f) and (h) of that proposi- 


tion, 
1 ae 
T-*3y,.,  To7By7y 
. 1 A :f W(r) dr 
a-| W(r) dr xf [W(r)}? dr 


: a 1 [wo dr 


1 40 
OAL tf wear [wore E an 


where the integral sign denotes integration over r from 0 to 1, Similarly, results 
(a) and (e) of Proposition 17.3 give 


T-"3u, L A-W(1) 
To'Sy,u.d — LELWODP ~ veh 
= A W(1) 0 
= ey i al a bie 2 F _ [17.6.4] 


me eae w(t) e 
? aft eae - ol = Fe = ‘al 


17.6, Phillips-Perron Tests for Unit Roots 507 


[17.6.3] 


Substituting [17.6.3] and [17.6.4] into [17.6.2] produces 


TG iE i Aa 1 [ wo ar 


T@r- DELO AL Lf wear fiwoor a 


Lo aD Ole SD Dane - al * Le sal 
OA 0 AL LH[W()P — 1 2{A? — Yo} 
Z ; ] 1 foe i w(1) 
0 Uf wea fowmpar| Lwar - 3 
: a 1 [wo dr i‘. 
+ -t 


er 
[wo dr [ wor dr lies = Yot/A 


[17.6.5] 
The second element of this vector states that 
T(r =, 1) 
a 1 [wo dr | w(t) 
. W(r) dr | [wine dr} LHTWO)P — 1 
1 win) dr | 
{ W(r) dr [wor dr 


[WP - 1 - wl) | W(r) dr (1/2)-(A2 — w) 


RSPR RO a 
[ wor dr — [J W(r) ar| -’ {f [W(r)P dr — I Wir) a] } 


The first term of the last equality in [17.6.6] is the same as [17.4.28], which 
described the asymptotic distribution that T(6, — 1) would have if u, were i.i.d. 
The final term in [17.6.6] is a correction for serial correlation. Notice that if u, is 
serially uncorrelated, then % = 1 and y = 0 for/ = 1, 2, . Thus, if u, is 
serially uncorrelated, then A? = o?-[({1))? = o? and y, = E(u2) = = ¢?, Hence 
[17.6.6] includes the earlier result [17.4.28] as a special case when 4, is serially 
uncorrelated, 

It is easy to use 6, the OLS standard error for 7, to construct a sample statistic 
that can be used to estimate a correction for serial correlation. Let Y; be the matrix 
defined in [17.4.21], and let s% be the OLS estimate of the variance of u,: 


T 
s} = (T — 2)7! > (y, — @r — Bry) 


Then the asymptotic distribution of T?- 63, can be found using the same approach 
as in [17.4.31] through [17.4.33]: 


508 Chapter 17 | Univariate Processes with Unit Roots 


2,42 
T On, 


EB Sy Te 
= s52[0 1]Y Y 
srl sy a Ad 


he ee [wou aul te 
seo nfo a) val load Lil 
[ we dr [ wor dr 


i [17.6.7] 
1 | W(r) dr 0 
= (s}/a2)[0 1] H 
[ wo dr [ wor dr 
ae 
[ wor dr — J W(r) a] 
It follows from [17.6.6] that 
T(6r — 1) ~ 1(T?-63, + 8%)? - yw) 
4 Tr - -3(4) +, (w - ») 
f ovonr ar - [f wear] [17.6.8] 


_ MWe - = wD [ WE ar 


[ wor dr - I W(r) ar) 


Thus, the statistic in [17.6.8] has the same asymptotic distribution [17.4.28] as the 
variable tabulated under the heading Case 2 in Table B.5. 
Result [17.6.8] can also be used to find the asymptotic distribution of the 


OLS t test of p = 1: 
(er _ 1) 


te = - 
7 Gy 


Pa 


_ TGr- 1) 
{T?- “63 ane 


| (atwarr - 1 - way J wo ar 
[ wor dr — L/ Wir) ar] 
+ {T?- 63,3? 
HWP - 1 - WA) | W(r) dr 


ee 
| [W(r)}? dr - II W(r) a] 
+ {(1/2)(L/sp)(A? — yo)} x {T?-65, + sz}? 


+ 4(T?-6},,/8}) (A — y) 


+ {T?-63 jl? 


17,6. Phillips-Perron Tests for Unit Roots 509 


4Iw()P - b - wi) | W(r) dr (") W2 


7 [ wor dr — ll W(r) ar] 


25) 42 
x {f [WNP dr - i W(r) ar| } 
+ {(1/2)(ls_)(A2 — yo} x {T?-62, + 53} [17.6.9] 
with the last convergence following from [17.6.7]. Moreover, 
Sea (TDS 0, ~ Gr ~ br? S BW) = we [17.6.10 
Hence, [17.6.9] implies that 


(wp - way {wo dr 


{f [W(AP dr - J W(r) ar} |" [17.6.11] 


+ {3(0? — yo)/A} x {T-6,, + Szh- 


(¥o/A?)!? > ty 4 


Thus, 
(Y¥o/ A?) tp EECA? = oA} X {T+ 65, + sz} 


SHOR -3- wa | wed p76.17 


; {[ voor a - [fw] } 


which is the same limiting distribution [17.4,36] obtained for the random variable 
tabulated for case 2 in Table B.6. 

The statistics in [17.6.8] and [17.6.12] require knowledge of the population 
parameters y, and A”. Although these moments are unknown, they are easy to 
estimate consistently. Since y, = E(u?), one consistent estimate is given by 


T 
hw = T'S a}, [17.6.13] 

t= 
where &@ = y, — &; — pyy,-,is the OLS sample residual. Phillips and Perron 
(1988) instead used the standard OLS estimate 7) = (T — 2)~'2da? = 5}. Similarly, 


from result (a) of Proposition 17.3, A? is the asymptotic variance of the sample 
mean of u: 


T 
VT-a = T-'2 ¥ u,> N(0, A2). [17.6.14] 
t=1 


Recalling the discussion of the variance of the sample mean in Sections 7.2 and 
10.5, this magnitude can equivalently be described as 


2 = o? [WLP = ye + 2 >. y; = 27s,(0), [17.6.15] 


where y; is the jth autocovariance of u, and s,(0) is the population spectrum of u, 
at frequency zero. Thus, any of the estimates of this magnitude proposed in Section 


510 Chapter 17 | Univariate Processes with Unit Roots 


10.5 might be used. For example, if only the first gq autocovariances are deemed 
relevant, the Newey-West estimator could be used: 


g 
= +2 »> [1 - iq + Dy; [17.6.16] 
fe 
where 
T 
4,= 7 ' > 4a, [17.6.17] 
r=j+l 


and d, = y, — &y — Bry,-1- 

To summarize, under the null hypothesis that the first difference of y, is a 
zero-mean covariance-stationary process, the Phillips and Perron? approach is 
to estimate equation [17.6.1] by OLS and use the standard OLS formulas to cal- 
culate 6 and its standard error G, along with the standard error of the regression 
s. The jth autocovariance of @, = y, — & — gy,_, is then calculated from [17.6.17]. 
The resulting estimates 7, and A? are then used in [17.6.8] to construct a statistic 
that has the same asymptotic distribution as does the variable tabulated in the case 
2 section of Table B.5. The analogous adjustments to the standard OLS ¢ test of 
p = 1 described in [17.6.12] produce a statistic that can be compared with the case 
2 section of Table B.6 


Example 17.6 


Let 4, denote the OLS sample residual for the interest rate regression [17.4.37] 
of Example 17.4: 


a, = i, — 0.211 — 0.96691 i,_, fort = 1,2,..., 168. 


(Q, E12) (0.019133) 
The estimated autocovariances of these OLS residuals are 


T T 
4 = (UT) > a2 = 0.630 4% = (/T) D Ga,_, = 0.114 
t= t=2 
T Tv 
% = (VT) > 4a, = -0.162 $= (1/T) 3, 4,0,_, = 0.064 
t=3 . rad 


ci 
$4 = (UT) > 0,0,_, = 0.047. 
te5 


"The procedure recommended by Phillips and Perron differs slightly from that in the text. To see 
the relation. write the first line of [17.6.7] as 


I T-*3y,_,] 7! 
nl eee ela 
T-“3y,_, To*By2.,4 Lt 


1 I 
T*Ey71 — THZy WP TT By7_ — (To 'By- 7} 
I 

TOT Sy. — VF] 


gt 2 2 
T?+6j,. + S} 


where y_, = T~'Zy,_, and the last equality follows from [4.A.5]. Instead of this expression, Phillips 
and Perron used 


I 
T-*E(y, ~ YP" 


The advanlage of the formula in the text is that it is trivial to calculate from the output produced by 
standard regression packages and the identical formula can be used for cases 1, 2, and 4. 


17.6. Phillips-Perron Tests for Unit Roots §11 


Thus, if the serial correlation of u, is to be described with g = 4 autocovariances, 


KX? = 0,630 + 2(8)(0.114) + 2(3)(-0.162) + 2(2)(0.064) + 2(4)(0.047) 
= 0.688. 
The usual OLS formula for the variance of the residuals from this regression 
is 
T 


s? = (T — 2)-' 3%) @? = 0.63760. 


t= 
Hence, the Phillips-Perron p statistic is 
T(B — 1) — (1/2)-(T?- 63/8?) (A? — 4) 
168(0.96691 — 1) — 4{[(168)(0.019133)]?/(0.63760)}(0.688 — 0.630)- 
= —6.03. 
Comparing this with the 5% critical value for case 2 of Table B.5, we see that 
~6.03 > — 13.8. We thus accept the null hypothesis that the interest rate data 
could plausibly have been generated by a simple unit root process. 
Similarly, the adjustment to the ¢ statistic from Example 17.4 described 
in [17.6.12] is 
(Hu/A?)"2t — 8A? — Fo)(T Gp/s) + A} 
= {(0.630)/(0.688)}'(0.96691 — 1)/0.019133 
~ {(1/2)(0.688 — 0.630)[[(168)(0.019133)//(0.63760)] + V/(0.688)} 
= —1.80. 


Since — 1.80 > —2.89, the null hypothesis of a unit root is again accepted at 
the 5% level. 


Phillips-Perron Tests for Cases 1 and 4 


The asymptotic distributions in [17.6.8] and [17.6.12] were derived under the 
assumption that the true process for the first difference of y, is serially correlated 
with mean zero. Even though the true unit root process exhibited no drift, it was 
assumed that the estimated OLS regression included a constant term as in case 2 
of Section 17.4. 

The same ideas can be used to generalize case 1 or case 4 of Section 17.4, 
and the statistics [17.6.8] and [17.6.12] can be compared in each case with the 
corresponding values in Tables B.5 and B.6. These results are summarized in Table 
17.2. The reader is invited to confirm these claims in exercises at the end of the 
chapter. 


Example 17.7 


The residuals from the GNP regression [17.4.56] have the following estimated 
autocovariances: 


Jo = 1.136 4 = 0.424 }% = 0.285 
4; = 0.006 +, = —0.110, 


from which 


A2 = 1.136 + 2{8(0.424) + 3(0.285) + 3(0.006) — $(0.110)} = 2.117. 


512 Chapter 17 | Univariate Processes with Unit Roots 


Also, s* = 1.15627. Thus, for these data the Phillips-Perron p test is 
T(p — 1) — KT?-63/s?(A? ~ 4) 


= 168(0.96252 — 1) - | 


: nna hen — 1.136) 


= — 10.76. 


Since — 10.76 > —21.0, the null hypothesis that log GNP follows a unit root 
process with or without drift is accepted at the 5% level. 
The Phillips-Perron ¢ test is 


(Ful A?) > {i(A? a I(T G,/s) + Ky 
= {(1.136)/(2.117)}"7(0.96252 — 1)/0.019304 


— {8(2.117 — 1.136)[[(168)(0.019304)]/\/1.15627)] + V(Z1I7} 
= —2.44, 


Since —2.44 > —3.44, the null hypothesis of a unit root is again accepted. 


More General Processes for u, 


The Newey-West estimator A? in [17.6.16] can provide a consistent estimate 
of A? for an MA() process, provided that g, the lag truncation parameter, goes 
to infinity as the sample size T grows, and provided that q grows sufficiently slowly 
relative to T. Phillips (1987) established such consistency assuming that gz > © 
and q7/T'* —» 0; for example, gz = A-T'® satisfies this requirement. Phillips’s 
results warrant using a larger value of g with a larger data set, though they do not 
tell us exactly how large to choose g in practice. Monte Carlo investigations have 
been provided by Phillips and Perron (1988), Schwert (1989), and Kim and Schmidt 
(1990), though no simple rule emerges from these studies. Andrews’s (1991) pro- 
cedures might be used in this context. 

Asymptotic results can also be obtained under weaker assumptions about u, 
than those in Proposition 17.3. For example, the reader may note from the proof 
of result 17.3(e) that the parameter y, appears because it is the plim of T~' x 
D7, u?. Under the conditions of the proposition, the law of large numbers ensures 
that this plim is just the expected value of u?, which expected value was denoted 
¥. However, even if the data are heterogeneously distributed with E(u?) = y,,, 
it may still be the case that T~'=7_, yo, converges to some constant. If 
T-'ZX7| u? also converges to this constant, then this constant plays the role of y, 
in a generalization of result 17.3(e). 

Similarly, let a; denote the sample mean from some heterogeneously dis- 
tributed process with population mean zero: 


T 
uy = T! 2 4; 
tml 
and let A3 denote T times the variance of dr: 
At = T- Var(iz) = T+ E(uy + up ++ + uz). 


The sample mean @; may still satisfy the central limit theorem: 


7 L 
T~'2 Su, N(O, A?) 
t=1 


17.6. Phillips-Perron Tests for Unit Roots 513 


TABLE 17.2 
Summary of Phillips-Perron Tests for Unit Roots 


Case 1: 


Estimated regression: y, = py,-, + &% 

True process: y, = y,_1 + 4, 

Z, has the same asymptotic distribution as the variable described under the 
heading Case 1 in Table B.5. 

Z, has the same asymptotic distribution as the variable described under Case 
1 in Table B.6. 


Case 2: 


Estimated regression: y, = a@ + py,., + 4, 

True process: y, = y,_1 + & 

Z, has the same asymptotic distribution as the variable described under Case 
2 in Table B.5. 

Z, has the same asymptotic distribution as the variable described under Case 
2 in Table B.6. 


Case 4: 


Estimated regression: y, = a + py,.; + 6t + u, 

True process: y, = a + y,_, + 4, q@ any 

Z, has the same asymptotic distribution as the variable described under Case 
4 in Table B.5. 

Z, has the same asymptotic distribution as the variable described under Case 
4 in Table B.6. 


Nates to Table 17.2 

Estimated regression indicates the form in which the regression is estimated, using observations 
t= 1,2...., T and conditioning on observation ¢ = 0. 

True process describes the null hypothesis under which the distribution is calculated, In each 
case, u, is assumed to have mean zero but can be heterogencously distributed and serially correlated 
with 


ae 
lim T-' 3) E(u?) = % 
Toe t=} 

lim T E(u, + us toes + ty) = AA 

vrs 


Z, is the following statistic: 
Z, = T(ér - 1) - (1/2){T? +63, + 53}(A} - Yur) 


where 
Tr 
har = TD aa, 
<i+ 
iu, = OLS sample residual from the estimated regression 
A= tar tS = a + Dir 
= 
= 
s} = (T- k)-' > a? 
i=1 
k = number of parameters in estimated regression 
Gy, = OLS standard error for p. 


Z, is the following statistic: 
Z, = (FurlAZ)"* (Br = 1)/6,,. 
— (V2)A} — Fur\GADIT 6, + Sr}. 


514 Chapter 17 | Univariate Processes with Unit Roots 


or 
f L 
T-'2 Su, A+ WA), 
t=l 


where 


2 = lim a3, [17.6.18] 
T= 
providing a basis for generalizing result 17.3(a). 

If u, were a covariance-stationary process with absolutely summable auto- 
covariances, then Proposition 7.5(b) would imply that lim; A} = 7.-+7;- 
Recalling [7.2.8], expression [17.6.18] would in this case just be another way to 
describe the parameter A? in Proposition 17.3. 

Thus, the parameters yg and A? in [17.6.8] and [17.6.12] can more generally 
be defined as 


lim T-! s E(u?) [17.6.19] 


Tox 

lim T7!-E(u, + uy + +++ + uz). [17.6.20] 
Tox : 

Phillips (1987) and Perron and Phillips (1988) derived [17.6.8] and [17.6.12] as- 
suming that u, is a zero-mean but otherwise heterogeneously distributed process 
satisfying certain restrictions on the serial dependence and higher moments. From 
this perspective, expressions [17.6.19] and [17.6.20] can be used as the definitions 
of the parameters yy and A”. Clearly, the estimators [17.6.13] and [17.6.16] continue 
to be appropriate for this alternative interpretation. 


Yo 


2 


On the Observational Equivalence of Unit Root 
and Covariance-Stationary Processes 


We saw in Section 15.4 that given any /(0) process for y, and any finite sample 
size T, there exists an /(1) process that will be impossible to distinguish from the 
(0) representation on the basis of the first and second sample moments of y. Yet 
the Phillips and Perron procedures seem to offer a way to test the null hypothesis 
that the sample was generated from an arbitrary /(1) process. What does it mean 
if the test leads us to reject the null hypothesis that y, is (1) when we know that 
there exists an /(1) process that describes the sample arbitrarily well? 

Some insight into this question can be gained by considering the example in 
equation [15.4.8], 


(1 — L)y, = (1 + OL)e, [17.6.21] 


where 6 is slightly larger than — 1 and ¢, is i.i.d. with mean zero and variance o?. 
The model [17.6.21] implies that 

Ye = (€ + O€,-1) + (ey + O€;-2) + +++ + Ce, + G€0) + Yo 
é + (1 + Oe, + (1 + Dea. +--+ + (1 + Oe, + Oey + YH 
e+ (1+ O)E,-1 + 8 + Yo, 


where 


6-1 =e + & tees + & Ly. 


17.6. Phillips-Perron Tests for Unit Roots 515 


For large ¢, the variable y, is dominated by the unit root component, (1 + @)é,_,, 
and the asymptotic results are all governed by this term. However, if @ is close to 
—1, then in a finite sample y, would behave essentially like the white noise series 
é, plus a constant (@€) + yo). In such a case the Phillips-Perron test is likely to 
reject the null hypothesis of a unit root in finite samples even though it is true.'° 
For example, Schwert (1989) generated Monte Carlo samples of size T = 1,000 
according to the unit root model [17.6.21] with @ = ~—0.8. The Phillips-Perron test 
that is supposed to reject only 5% of the time actually rejected the null hypothesis 
in virtually every sample, even though the null hypothesis is true! Similar results 
were reported by Phillips and Perron (1988) and Kim and Schmidt (1990). 
Campbell and Perron (1991) argued that such false rejections are not nec- 
essarily a bad thing. If @ is near —1, then for many purposes an /(0) model may 
provide a more useful description of the process in [17.6.21] than does the true 
1(1) model. In support of this claim, they generated samples from the process 
[17.6.21] and estimated by OLS both an autoregressive process in levels, 


VY = C+ Gy -1 + Gaya tere + DpVi-p + €,, 
and an autoregressive process in differences, 
Ay, =at gAy,-1 + g,Ay,-2 a ao £,AY:—p + &,. 


They found that for @ close to —1, forecasts based on the levels y, tended to 
perform better than those based on the differences Ay,, even though the true data- 
generating process was /(1). 

A relatéd issue, of course, arises with false acceptances. Clearly, if the true 
model is 


Vr = PVr-1 + & [17.6.22] 


with p slightly below 1, then the null hypothesis that p = 1 is likely to be accepted 
in small samples, even though it is false. The value of accepting a false null hy- 
pothesis in this case is that imposing the condition p = 1 may produce a better 
forecast than one based on an estimated 6,, particularly given the small-sample 
downward bias of 6;. Furthermore, when p is close to 1, the values in Table B.6 
might give a better small-sample approximation to the distribution of (6; — 1) + 
6, than the traditional ¢ tables." 

This discussion underscores that the goal of unit root tests is to find a par- 
simonious representation that gives a reasonable approximation to the true process, 
as opposed to determining whether or not the true process is literally /(1). 


17.7. Asymptotic Properties of a pth-Order 
Autoregression and the Augmented 
Dickey-Fuller Tests for Unit Roots 


The Phillips-Perron tests were based on simple OLS regressions of y, on its own 
lagged value and possibly a constant or time trend as well. Corrections for serial 
correlation were then made to the standard OLS coefficient and t statistics. This 
section discusses an alternative approach, due to Dickey and Fuller (1979), which 
controls for serial correlation by including higher-order autoregressive terms in the 
regression. 


“For more delailed discussion, see Phillips and Perron (1988, p. 344). 
"See Evans and Savin (1981, 1984) for a description of the small-sample distributions. 


516 Chapter 17 | Univariate Processes with Unit Roots 


An Alternative Representation of an AR(p) Process 


Suppose that the data were really generated from an AR(p) process, 


(1 - GL — Ll? - +--+ — $,L?)y, = &, [17.7.1] 


where {e,} is an i.i.d. sequence with mean zero, variance o7, and finite fourth 
moment. It is helpful to write the autoregression [17.7.1] in a slightly different 
form. To do so, define 


p=O+ d+ °°° + [17.7.2] 

G = —[Gar + G2 t+ + bl forj=1,2,...,p—1. [17.7.3] 
Notice that for any values of ,, $2, . . . , @,, the following polynomials in L are 
equivalent: 


(l- pL) - @L+ OL? ++--4+2,-,L?-Q - L) 
=1-plL—-¢L4¢,L? - o,L? + 41? anaes eat bp-1kP! + f,-1L? 
=1-(e+%)L-@-%)l?-G— &)be---: 

Cpe Gelb (He 
=1-[(h + d2 +-+°> + ,) — (+ bs + °° + o,))L 
a (bet Gy be) reget re lL 
~ [-(b,) + (bp + b,))LP! — (6, )L? 
=1-4L—- $l? —----—- $,-,L°"' — $,L?. [17.7.4] 


Thus, the autoregression [17.7.1] can equivalently be written 

{i - pL) - (QL + G0? +--- 4+ %-,Leo') — Ly, = [17.7.5] 
or 

Ye = pW-1 + SAy,-, + GAy,2 + °° + G-AW-par + & [17.7.6] 


Suppose that the process that generated y, contains a single unit root; that 
is, suppose one root of 


(1 - oz — @&2z2---- — $2) = 0 [17.7.7] 
is unity, 
1-¢-&--:'-¢%=9, [17.7.8] 


and all other roots of [17.7.7] are outside the unit circle. Notice that [17.7.8] implies 
that the coefficient p in [17.7.2] is unity. Moreover, when p = 1, expression [17.7.4] 
would imply 


(1 — bz — b227 — ++ - — by2") 
= (1 - O42 - Gz? - +--+ — S_,2? 11 - 2). 


Of the p values of z that make the left side of [17.7.9] zero, one is z = 1 and all 
other roots are presumed to be outside the unit circle. The same must be true of 
the right side as well, meaning that all roots of 


(1 -— Jz -— Sz? —-++ — 227!) =0 


[17.7.9] 


17.7. Asymptotic Properties of a pth-Order Autoregression 517 


lie outside the unit circle. Under the null hypothesis that p = 1, expression [17.7.5] 
could then be written as 


(Lo Gb = Gl? = vy sf LP) Aye =e, 
or 
Ay, = 4, [17.7.10] 
where 
u= (1-40 - GL? —--+ -— gL?!) Ie, 


Equation [17.7.10] indicates that y, behaves like the variable é, described in Prop- 
osition 17.3, with : 


HL)= (= Qh = Gb eens = fakery. 


One of the advantages of writing the autoregression of [17.7.1] in the equiv- 
alent form of [17.7.6] is that only one of the regressors in [17.7.6], namely, y,_,, 
is (1), whereas all of the other regressors (Ay,_,, Ay,2, ..-, AY,-p+1) are 
stationary. Thus, [17.7.6] is the Sims, Stock, and Watson (1990) canonical form, 
originally proposed for this problem by Fuller (1976). Since no knowledge of 
any population parameters is needed to write the model in this canonical form, 
in this case it is convenient to estimate the parameters by direct OLS estimation of 
[17.7.6]. 

Results generalizing those for case 1 in Section 17.4 are obtained when the 
regression is estimated as written in [17.7.6] without a constant term. Cases 2 and 
3 are generalized by including a constant term in [17.7.6], while case 4 is generalized 
by including a constant term and a time trend in [17.7.6]. For illustration, the case 
2 regression is discussed in detail. Comparable results for case 1, case 3, and case 
4 will be summarized in Table 17.3 later in this section, with details developed in 
exercises at the end of the chapter. 


Case 2. The Estimated Autoregression Includes a Constant 
Term, but the Data Were Really Generated by a Unit Root 
Autoregression with No Drift 


Following the usual notational convention for OLS estimation of autoregres- 
sions, we assume that the initial sample is of size T + p, with observations numbered 
{¥_ 41: ¥-p+2» +--+ +r}, and condition on the first p observations. We are inter- 
ested in the properties of OLS estimation of 


y, = OAy,-1 + Gay. t+ + + 6,-AY;—p+t tat py, + & 


aeback, [17.7.1] 


where B = (2,, G3, ---. 4-1» & p)’ and x, = (Ay,_,, Ay,»,---, AY,-pan 1, 
y,-1)’. The deviation of the OLS estimate b; from the true value B is given by 


br - B= e xxi] [3 xe : [17.7.12] 


r= 


518 Chapter 17 | Univariate Processes with Unit Roots 


Letting u, = y, — y,-1, the individual terms in [17.7.12] are 


T 
SX xx! = (17.7.13] 
f= 

Dur Zu, _ ut,» ae ZU,— yp st Zu, —| Zu, 1-1 

Zu, 2l— 1 Zu; _2 nt By pa EU -2 — Eety_2Y,-1 
ity pa tr Lu, p+tlt;-2 a Lup st ZU, _ p41 Zi; pe1Vr—1 
Zu, 4 Zu, 2 owe >> aera T Zy,-1 
Ly, tra Ly, — 14-2 ae ZY, typ at zy, Sy?_, 
Zu, 1€, 
Zu, 2, 
T : 
ne : 17.7.14 
Py sais Zu, p+ 1€; [ 
Le, 
zy, 1&, 
with > denoting summation over ¢ = 1,2,..., T. 

Under the null hypothesis that a = 0 and p = 1, we saw in [17.7.10] that y, 
behaves like & = uw, + u, + +--+ + u, in Proposition 17.3. Consulting the rates 
of convergence in Proposition 17.3, for this case the scaling matrix should be 

VT 0 + 0 0 
0 vF-- 0 0 

Y, = : NG mye : [17.7.15] 
(pt+bkx ptt) 0 0 ae VT 0 
0 0 -- O T 


Premultiplying [17.7.12] by Y; as in [17.4.20] results in 


Y;(b; - B) = {re'| 3 sxi|ye'} {ve[3 sail} [17.7.16] 


Consider the matrix Y7'2x,x;Y;!. Elements in the upper left (p x p) block of 
2x,x; are divided by T, the first p elements of the (p + 1)th row or (p + 1)th 
column are divided by T*”, and the row (p + 1), column (p + 1) element of 
2x,x; is divided by T?. Moreover, 


To 'Su,_U,-;> Yy-;, from result (c) of Proposition 17.3 

To 'Su,_; 4 E(u,_;) = 0 from the law of large numbers 

T-3?3y,_,u,.;-> 0 from Proposition 17.3(e) 

T-7?Zy,_ | st i W(r) dr from Proposition 17.3(f) 

T~*3y?_, 4 a2| [W(r)]? dr from Proposition 17.3(h), 
where 


y,; = E{(Ay,(Ay,_)} 
ASa-W1) =o/(1-%-G-+++- Gy) [17.7.17] 
o* = E(e?) 


17.7. Asymptotic Properties of a pth-Order Autoregression 519 


and the integral sign denotes integration over r from 0 to 1. Thus, 


Yr '[2xx/]¥ 7! 


Yo " Vp-2 0 0 
nN Yo Yp-3 0 0 
L : : : : : 
eat 
Vp-2 Vp-3 °°" Yo 0 0 
0 O - 0 1 a W(r) dr 
0 O - 0 af W(r) dr af [W(r)}* dr 
=|Vv 90 
9 QI’ [17.7.18] 
where 
Yo Nn Yp-2 
yva| oo 0 es [17.7.19] 
Yp-2, Vp-3 °°" Yo 
1 af Wr) dr 
. Qe [17.7.20] 
A { W(r) dr xf [W(r)}* dr 
Next, consider the second term in [17.7.16], 
T~ Xu, _,€, 
T~!?3u,_2€, 
Y7'E = : 7. 
7 [2x,e] T Supa te; [17.7.21] 
T~ "Xe, 
T~'2y,_ 8, 


The first p - 1 elements of this vector are VT times the sample mean of a martin- 
gale difference sequence whOse variance-covariance matrix is 


Uy &; 
E ieee Ce ee eT A 
Uiegsak 
Yo MN oT Yp-2 [17.7.22] 


o v1 Yo “°° Yp-3 
Yp-2 Yp-3 °"" Yo 
oV. 


520 Chapter 17 | Univariate Processes with Unit Roots 


Thus, the first p — 1 terms in [17.7.21] satisfy the usual central limit theorem, 


T~?Su,_,e, 

— 12 
T Zu, 2; an h, ~ N(0, oV). [17.7.23] 
[> MSU p+ rE. 


The distribution of the last two elements in [17.7.21] can be obtained from results 
(a) and (d) of Proposition 17.3: 


T-'*3e, | ov ze a: W(1) (17.7.24] 
—h,~ . of. 
TOD “  Laoa {[W(L)P - 1} 
Substituting [17.7.18] through [17.7.24] into [17.7:16] results in 
-1 
~ [Vv 0] Th, V-'h, 
‘“- B)-> = : 17.7.25 
Ba k al E peel 


Coefficients on Ay,-; 


The first p — 1 elements of B are £1, %,... , %,-1, which are the coefficients 


on zero-mean stationary regressors (Ay,_,, Ay,-2,... , AY,-p+1)- The block con- 
sisting of the first p — 1 elements in [17.7.25] states that 
hr - hi 
VT| a, & 1 Aven, [17.7.26] 
beaeur — Go 


Recalling from [17.7.23] that h, ~ N(0, oV), it follows that V~"h, ~ N(0,0°V~'), 
or 


bir m ra 0 Yo 1 eee Yo -2 ~ 
VE) Se | Sa cot MM OS Me) otigon 
bia. ~ op -t. 0 Yp-2 Vp-3 7" % 


where y= E{(Ay,)(Ay,_;)}. ; 

This means that a null hypothesis involving the coefficients on the stationary 
regressors (¢,, 2, ... , g,-1) in [17.7.11] can be tested in the usual way, with the 
standard ¢ and F statistics asymptotically valid. To see this, suppose that the null 
hypothesis is Hy: RB = r for R a known [m X (p + 1)] matrix where m is the 
number of restrictions. The Wald form of the OLS y? test [8.2.23] is given by 


r -1 =i 
xz = (Rb; - o{san| 5 xxi | n'| (Rb; — r) 
t= 


: “1 [17.7.28] 
vrr'} 


= [RVT(b; - pr{san-v7| & sx; 
x [R-VT(b; — B)], 


17.7. Asymptotic Properties of a pth-Order Autoregression 521 


where 


T 
sy = [T - (p + I} > (y, — &rAy-1 -— Grdy2 - ++: 


: . . 17.7.29 
— bp-1.7AY-pai — ar - bry,-1) [ 


4 E(e2) = a. 


If none of the restrictions involves a or p, then the last two columns of R contain 
all zeros: 


R -| R, 0 } [17.7.30] 
larx(p+ lt Jarx(p—-—L)}  Onx2) 


In this case, RVT = RY; for Y; the matrix in [17.7.15], so that [17.7.28] can be 
written as 


T 


xt = [RY7(b; - pr{erny,| xxi] vr’ [RY;(b; — B)}. 


t=l 


From [17.7.18], [17.7.25], [17.7.29], and [17.7.30], this converges to 


maim olon} 
“"TRAy\O —'h, 17.7.31 
<form aly a) eT} fe afte] 


= [R,V~'h,]'[o?R,V~'Ri]-"[R,V~'h)]. 


But since h, ~ N(0, o?V), it follows that the (rm x 1) vector [R,V~'h,] is distributed 
N(0, [o?R,V~'R{]). Hence, expression [17.7.31] is a quadratic form in a Gaussian 
vector that satisfies the conditions of Proposition 8.1: 


L 
x > x7(m). 
This verifies that the usual ¢ or F tests applied to any subset of the coefficients 
£1, 2, +» , Sp—, have the standard limiting distributions. 
Note, moreover, that [17.7.27] is exactly the same asymptotic distribution 


that would be obtained if the data were differenced before estimating the auto- 
regression: 


Ay, = f,Ay,-, + fAy,-2 + °° + Gy AY pa + & + & 


Thus, if the goal is to estimate 2, f2,..., ¢,-1 Or test hypotheses about these 
coefficients, there is no need based on asymptotic distribution theory for differ- 
encing the data before estimating the autoregression. Many researchers do rec- 
ommend differencing the data first, but the reason is to reduce the small-sample 
bias and small-sample mean squared errors of the estimates, not to change the 
asymptotic distribution. 


Coefficients on Constant Term and y,_, 


The last two elements of B are a and p, which are coefficients on the constant 
term and the /(1) regressor, y,_,. From [17.7.25], [17.7.20], and [17.7.24], their 


522 Chapter 17 | Univariate Processes with Unit Roots 


limiting distribution is given by 


ie be 


7 1 af W(r) dr o- Wh) 
A: | W(r) dr 22. { [W(r)P dr| Leer {[W(DP — 1 


-1 
E | [17.7.32] 


-t 


: f ae 1 | we ar 


oe | we) dr | wor dr 


«LoS Laowear - a 
0A} LHIWDP - 0 


Oe. | W(r) dr | [W()P dr HWP - 
The second element of this vector implies that (A/c) times T(6, — 1) has the same 


asymptotic distribution as [17.4.28], which described the estimate of p in a regression 
without lagged Ay and with serially uncorrelated disturbances: 


-1t 


HTW(DP - 1} - wo): W(r) dr 


T-(Ala)-(67 — 1) ——____—_|._ [17.7.3] 
{f [W(r)]}? dr - If Wr) a] } 
Recall from [17.7.17] that 
Mo= (=, te - Gay’ [17.7.34] 
This magnitude is clearly estimated consistently by 
Mahe =a ewes Says 


where Gs denotes the estimate of ¢; based on the OLS regression [17.7.11]. Thus, 
the generalization of the Dickey-Fuller p test when lagged changes in y are included 
in the regression is 


T(6r-1) _ MIM - y— wa: { W(r) dr 


a ; Srey ae pr 5 [17.7.35] 
L.7 — $2.7 p-1.T | [WP dr — J wr) a] 
This is to be compared with the case 2 section of Table B.5. 
Consider next an OLS t test of the null hypothesis that p = 1: 
oe) _ [17.7.36] 


tt =-e_——oo 
7 fs e541 (Bux)! ey, sh?” 


where e,,, denotes a[(p + 1) x 1] vector with unity in the last position and zeros 
elsewhere. Multiplying the numerator and denominator of [17.7.36] by T re- 


17.7. Asymptotic Properties of a pth-Order Autoregression 523 


sults in 
T(ér -— 1 
Orel) a [17.7.37] 
{53-65% rCx)-"Wrepui} 


t= 


But 


-\ 
0541 V7 (2xx,)'¥ rea = ear ¥F'exae'| Coat 


v8 
pt 0 Q-' Chet 


— e! 
1 


eff wera -[f vo ar] | 


by virtue of [17.7.18] and [17.7.20]. Hence, from [17.7.37] and [17.7.33], 


MEW CDR - 1 - wa)-| We dr 


i [W(r)}2 dr - | W(r) a] 


2 


ty Ss (a/A) 


W/2 


o 


; offowora -[f wma] | 


HWE - 1 — way-| WO ar 


{ | wor ar - | wo ar}\" 


This is the same distribution as in [17.4.36]. Thus, the usual ¢ test of p = 1 for 
OLS estimation of [17.7.11] can be compared with the case 2 section of Table B.6 
without any corrections for the fact that lagged values of Ay are included in the 
regression. 

A similar result applies to the Dickey-Fuller F test of the joint hypothesis 
that a = 0 and p = 1. This null hypothesis can be represented as RB = r, where 


R = [ 0 1 
12x(n+ I2x(p-W] (2*2) 


and r = (0, 1)’. The F test is then 


[17.7.38] 


-1 
Fr = (by - By'R’| st R(x) ef R(b; ~ B)/2. —_[17.7.39] 


Define Y; to be the following (2 x 2) matrix: 


. V2 
Y= kK uF [17.7.40] 


524 Chapter 17 | Univariate Processes with Unit Roots 


Notice that [17.7.39] can be written 


af 
F, = (by - BYR'Ys|s}-¥rRxx)'R'Yy| 
x Y;R(b; — By2. 
The matrix in [17.7.40] has the property that 


Y;R = RY; 


[17.7.41] 


forR = [0 1,] and Y;the (p + 1) x (p + 1) matrix in [17.7.15]. From [17.7.25], 
RY;(b; — B) > Q~'h,. Thus, [17.7.41] implies that 


-t 
F, = (by ~ By (RY) RVC) Rf RY;(b; — B)/2 
5 (Q-th,)'{o?Q- = (Q- "hy )/2 


= hiQ~'h,/(20?) 


= (weer) 0 we) roA{(W(1)]}? - a | 


, 1 af W(r) dr o-W(1) 
Ae | W(r) dr a2: | [Winp dr| Lioa[WC)P - 1 
= ()\,2 1 — 1 0 
- (Ga)e [won 2 WO) H] [° A [17.7.42] 


: [2 ah 1 [ wo dr 


1 a 
ea | W(n dr | [W(n)]? dr E 


. E oT W(1) 
0 a} Laqwap - 9 


= a[ wap [WP - a 


1 | W(r) dr W(t) 
x : 
i W(r) dr | (W(r)P dr {[W(1)? - 1 


This is identical to the asymptotic distribution of the F test when the regression 
does not include lagged Ay and the disturbances are i.i.d. Thus, the F statistic in 
[17.7.41] based on OLS estimation of [17.7.11] can be compared with the case 2 
section of Table B.7 without corrections. 

Finally, consider a hypothesis test involving a restriction'? across ¢,, £2,... , 
g,-1 and p, 


Ay rg t+ ngatcc tty g-1 + Oat nyo =r 


‘Since the maintained assumption is that p = 1, this is a slightly unnatural way to write a hypothesis. 
Nevertheless, framing the hypothesis this way will shortly prove useful in deriving the asymptotic 
distribution of an autoregression estimated in the usual form without the Dickey-Fuller transformation. 


17.7, Asymptotic Properties of a pth-Order Autoregression 525 


or 


rp =r. [17.7.43] 
The distribution of the ¢ test of this hypothesis will be dominated asymptotically 
by the parameters with the slowest rate of convergence, namely, 2), f2,. « - .fp-1- 


Since these are asymptotically Gaussian, the test statistic is asymptotically Gaussian 
and so can be compared with the usual ¢ tables. To demonstrate this formally, note 
that the usual ¢ statistic for testing this hypothesis is 


, . 7a 
en le ne eA cael em 


tr ms V2 
{sir'@xxy} {s37!%(@xx)-'er al 
Define rt; to be the vector that results when the last element of r is replaced by 
lye SVT, 
Fem [r re tt tyr 0 teil VT), [17.7.45] 
and notice that 
T'?r = YF, [17.7.46] 


for Y; the matrix in [17.7.15]. Using [17.7.46] and the null hypothesis that r = 
r'B, expression [17.7.44] can be written 


F7Y (by — B) 


7) 
{s FEY 7(Ex,x;)~ veer} 


tr = [17.7.47] 


Notice from [17.7.45] that 


where 
P=([ro rg cts ry, 0 O}. 


Using this result along with [17.7.18] and [17.7.25] in [17.7.47] produces 


arrears 
a! V~! 0 Jel [17.7.48] 
oF) 9 go iF 
_ [r, v2 ttt ry— i] V7 thy 
{o7[r, r Sb r-alV- Tr, ly eat ry- Ae a 


Since h, ~ N(0, oV), it follows that 
[r, ry ** ty] V¥~'h, ~ N(O, A), 
where 
h=o Fr, rg tt rye [rg ttt yal: 


Thus, the limiting distribution in [17.7.48] is that of a Gaussian scalar divided by 
its standard deviation and is therefore N(0, 1). This confirms the claim that the ¢ 
test of r'B = r can be compared with the usual ¢ tables. 


526 Chapter 17 | Univariate Processes with Unit Roots 


One interesting implication of this last result concerns the asymptotic prop- 
erties of the estimated coefficients if the autoregression is estimated in the usual 
levels form rather than the transformed regression [17.7.11]. Thus, suppose that 
the following specification is estimated by OLS: 


Yr = & + PY + G22 + °° + Opry + & ier] 
for some p = 2. Recalling [17.7.2] and [17.7.3], the relation between the estimates 
(i Gy ones bp-1s 6) investigated previously and estimates (A, ba, +. dy) 
based on OLS estimation of [17.7.49] is 

d, = -6,- 

&=%- &, forj=2,3,...,p-1 

db =pt ii. 
Thus, each of the coefficients bi, a, - « dp is a linear combination of the elements 
of (2, be - 1 oe fp). The analysis ‘of [17.7.43] establishes that any individual 


estimate 4; converges at rate VT to a Gaussian random variable. Recalling the 
discussion ‘of [16.3.20] and [16.3.21], an OLS t or F test based on [17.7.49] is 
numerically identical to the equivalent ¢ or F test expressed in terms of the rep- 
resentation in [17.7.11]. Thus, the usual ¢ tests associated with hypotheses about 
any individual coefficients di, b2, + 4, in [17.7.49] can be compared with 
standard t or N(0, 1) tables. Indeed, any hypothesis about linear combinations of 
the ¢’s other than the sum $, + $2 + + + ¢, Satisfies the standard conditions. 
The sum ¢$, + re , dy, of course, has the nonstandard distribution of the 
estimate 6 described in 17. 7,33). 


Summary of Asymptotic Results for an Estimated 
Autoregression That Includes a Constant Term 


The preceding analysis applies to OLS estimation of 
y, = OAy,-1 + &Ay,-2 t+ + bp-1AYi—p+1 + at py, + &, 


under the assumption that the true value of @ is zero and the true value of p is 1. 
The other maintained assumptions were that ¢, is i.i.d. with mean zero, variance 
o’, and finite fourth moment and that roots of 


Ca i a 


are outside the unit circle. It was seen that the estimates 2 I> é Deets 5 é p-1 converge 
at rate VT to Gaussian variates, and standard ¢ or F tests for hypotheses about 
these coefficients have the usual limiting Gaussian or y? distributions. The estimates 
a& and 6 converge at rates VT and T, respectively, to nonstandard distributions. 
If the difference between the OLS estimate 6 and the hypothesized true value of 
unity is multiplied by the sample size and divided by (1 - , - 2, - ++: - &-1), 
the resulting statistic has the same asymptotic distribution as the variable tabulated 
in the case 2 section of Table B.5. The usual ¢ statistic of the hypothesis p = 1 
does not need to be adjusted for sample size or serial correlation and has the same 
asymptotic distribution as the variable tabulated in the case 2 section of Table B.6. 
The usual F statistic of the joint hypothesis a = 0 and p = 1 likewise does not 
have to be adjusted for sample size or serial correlation and has the same distri- 
bution as the variable tabulated in the case 2 section of Table B.7. __ 

When the autoregression includes lagged changes as here, tests for a unit root 
based on the value of p, ttests, or F tests are described as augmented Dickey-Fuller 
tests. 


17.7, Asymptotic Properties of a pth-Order Autoregression 527 


Example 17.8 
The following model was estimated by OLS for the interest rate data described 
in Example 17.3 (standard errors in parentheses): 


i, = 0.335 Ai,_, ~ 0.388 Ai,_, + 0.276 Ai,_; 


(0.0788) (0.0808) (0.0800) 
— 0.107 Ai,_, + 0.195 + 0.96904 i,_,. 
(0.0794) (0.109) (0.018604) 


Dates t = 1948:I] through 1989:I were used for estimation, so in this case the 
| sample size is T = 164. For these estimates, the augmented Dickey-Fuller p 
test [17.7.35] would be 


164 
1 — 0.335 + 0.388 — 0.276 + 0.107 


Since —5.74 > —13.8, the null hypothesis that the Treasury bill rate follows 
a fifth-order autoregression with no constant term, and a single unit root, is 
accepted at the 5% level. The OLS t test for this same hypothesis is 


(0.96904 — 1)/(0.018604) = 1.66. 


Since —1.66 > —2.89, the null hypothesis of a unit root is accepted by the 
augmented Dickey-Fuller ¢ test as well. Finally, the OLS F test of the joint 
null hypothesis that p = 1 and a = 0 is 1.65. Since this is less than 4.68, the 
null hypothesis is again accepted. 

The-null hypothesis that the autoregression in levels requires only four 
lags is based on the OLS t test of {, = 0: 


— 0.107/0.0794 = —1.35. 


From Table B.3, the 5% two-sided critical value for at variable with 158 degrees 
of freedom is — 1.98. Since — 1.35 > — 1.98, the null hypothesis that only four 
lags are needed for the autoregression in levels is accepted. 


(0.96904 — 1) = -5.74, 


Asymptotic Results for Other Autoregressions 


Up to this point in this section, we have considered an autoregression that is 
a generalization of case 2 of Section 17.4—~a constant is included in the estimated 
regression, though the population process is presumed to exhibit no drift. Parallel 
generalizations for cases 1, 3, and 4 can be obtained in the same fashion. The 
reader is invited to derive these generalizations in exercises at the end of the chapter. 
The key results are summarized in Table 17.3. 


TABLE 17.3 
Summary of Asymptotic Results for Autoregressions Containing a Unit Root 


Case 1: 
Estimated regression: 
Y, = (AY. + GAy-2 + °° + G&-AW-pa1 + PM-1 + & 
True process: same specification as estimated regression with p = 1 


Any ¢ or F test involving ¢,, g, . - - , §,-1 can be compared with the usual ¢ 
or F tables for an asymptotically valid test. 


528 Chapter 17 | Univariate Processes with Unit Roots 


TABLE 17.3 (continued) 


Zor has the same asymptotic distribution as the variable described under the 
heading Case 1 in Table B.5. 

OLS t test of p = 1 has the same asymptotic distribution as the variable 
described under Case 1 in Table B.6. 
Case 2: 

Estimated regression: 

y= g,Ay,-1 + f2Ay,_2 cee ae by 1 AYi—p+1 tat PY; -1 + €, 

True process: same specification as estimated regression with a = 0 and 
p=1 

Any t or F test involving Z,, 2, . . . , g,-, can be compared with the usual ¢ 
or F tables for an asymptotically valid test. 

Zoe has the same asymptotic distribution as the variable described under Case 
2 in Table B.S. 

OLS t test of p = 1 has the same asymptotic distribution as the variable 
described under Case 2 in Table B.6. 

OLS F test of joint hypothesis that a = 0 and p = 1 has the same asymptotic 
distribution as the variable described under Case 2 in Table B.7. 


Case 3: . 

Estimated regression: 

d= oAy,-1 + gAy,-2 t+ + bp-1AY-p+1 + ast py, + € 

True process: same specification as estimated regression with a # 0 and 
p=1 

fr converges at rate T°? to a Gaussian variable; all other estimated coeffi- 
cients converge at rate T'? to Gaussian variables. 

Any tor F test involving any coefficients from the regression can be compared 
with the usual ¢ or F tables for an asymptotically valid test. 


Case 4: 

Estimated regression: 

y= fiAy,-1 + fA y,-2 pam Sia bp-1 AY p41 + a+ py; + dt + «, 

True process: same specification as estimated regression with a any value, 
p=i,ands6=0 , 

Any ¢ or F test involving ¢,, 2, . . . , g,-, can be compared with the usual ¢ 
or F tables for an asymptotically valid test. 

Zp¢ has the same asymptotic distribution as the variable described under Case 
4 in Table B.5. 

OLS t test of p = 1 has the same asymptotic distribution as the variable 
described under Case 4 in Table B.6. 

OLS F test of joint hypothesis that p = 1 and & = 0 has the same asymptotic 
distribution as the variable described under Case 4 in Table B.7. 
Notes to Table 17.3 

Estimated regression indicates the form in which the puree is estimated, using observations 
t= 1,2, , T and conditioning on observations ¢ = 0, — -,~pti. 


True process describes the null hypothesis under vic the distribution i is calculated. In each 
case it is assumed that roots of 


(l= fz - GzP ae Gaze) = 
are all outside the unit circle and that ¢, is i.i.d. with mean zero, variance a7, and finite fourth moment. 
Zor in each case is the following statistic: 


Zor = T(6r - IW - Br- bar - = - ut), 
where fr. fir. dare Epbe 1.7 are the OLS estimates from the indicated regression, 


OLS ¢ test of p = ‘4 is (67 - ie, . where Gp, i is the OLS standard error of 6. 
OLS F test of a hypothesis involving two restrictions is given by expression [17.7.39]. 


17.7. Asymptotic Properties of a pth-Order Autoregression 529 


Example 17.9 
The following autoregression was estimated by OLS for the GNP data in Figure 
17.3 (standard errors in parentheses): 


y, = 0.329 Ay,, + 0.209 Ay,» ~ 0.084 Ay,_, 


(0.0777) (0.0813) (0.0818) 
— 0.075 Ay,.4 + 35.92 + 0.94969 y,_, + 0.0378 ¢. 
(0.0788) (13.57) (0.019386) (0.0152) 


Here, T = 164 and the augmented Dickey-Fuller p test is 


164 
1 — 0.329 — 0.209 + 0.084 + 0.075 
Since — 13.3 > —21.0, the null hypothesis that the log of GNP is ARIMA(4, 


1, 0) with possible drift is accepted at the 5% level. The augmented Dickey- 
Fuller ¢ test also accepts this hypothesis: 


(0.94969 — 1)/(0.019386) = —2.60 > —3.44. 


The OLS F test of the joint null hypothesis that p = 1 and 6 = 0 is 3.74 < 
6.42, and so the augmented Dickey-Fuller F test is also consistent with the unit 
root specification. 


(0.94969 ~ 1) = — 13.3. 


Unit Root AR(p) Processes with p Unknown 


Various suggestions have been proposed for how to proceed when the process 
is regarded as ARIMA(p, 1, 0) with p unknown but finite. One simple approach 
is to estimate [17.7.11] with p taken to be some prespecified upper bound p. The 
OLS (test of {;., = 0 can then be compared with the usual critical value for a ¢ 
statistic from Table B.3. If the null hypothesis is accepted, the OLS F test of the 
joint null hypothesis that both g;., = 0 and g,;., = 0 can be compared with the 
usual F(2, T — k) distribution in Table B.4. The procedure continues sequentially 
until the joint null hypothesis that ¢;_, = 0,g;-2 = 0,....5-¢ = 0 is rejected 
for some €. The recommended regression is then 


y, = SAy,-1 + SAy,-2 + 6+ + Os-Ay gee + @ + py, + St. 


If no value of @ leads to rejection, the simple Dickey-Fuller test of Table 17.1 is 
used, Hall (1991) discussed a variety of alternative strategies for estimating p. 

Just as in the Phillips-Perron consideration of the MA(~) case, the researcher 
might want to choose bigger values for p, the autoregressive lag length, the larger 
is the sample size T. Said and Dickey (1984) showed that as long as p goes to 
infinity sufficiently slowly relative to T, then the OLS ¢ test of p = 1 can continue 
to be compared with the Dickey-Fuller values in Table B.6. 

Again, it is worthwhile to keep in mind that there always exists a p such 
that an ARIMA(p, 1, 0) representation can describe a stationary process arbi- 
trarily well for a given sample. The Said-Dickey test of p = 1 might therefore 
best be viewed as follows. For a given fixed p, we can certainly ask whether an 
ARIMA(p — 1, 1, 0) describes the data nearly as well as an ARIMA(p, 0, 0). 
Imposing p = 1 when the true value of p is close to unity may improve forecasts 
and small-sample estimates of the other parameters. The Said-Dickey result permits 
the researcher to use a larger value of p on which to base this comparison the 
larger is the sample size T. 


530 Chapter 17 | Univariate Processes with Unit Roots 


17.8. Other Approaches to Testing for Unit Roots 


This section briefly describes some alternative approaches to testing for unit roots. 


Variance Ratio Tests 
Let 
Ay, = a+ u,, 


where 


-> We; = W(Le, 


for , a white noise sequence with variance o?. Recall from expression [15.3.10] 
that the permanent effect of ¢, on the level of y,,, is given by 


lim —— 


Sox 


OY +s = 
ast = ¥(l). 


If y, is stationary or stationary around a deterministic time trend, an innovation e«, 
has no permanent effect on y, requiring w(1) = 0. 

Cochrane (1988) and Lo and MacKinlay (1988) proposed a test for unit roots 
that exploits this property. Consider the change in y over s periods, 


Vias — Vr = OS + Uy yy Hyg sep Fo + bay, [17.8.1] 
and notice that 
(Yes — ys = at Sas + oa sey Hot + Ua). [17.8.2] 


The second term in [17.8.2] could be viewed as the sample mean of s observations 
drawn from the process followed by u. Thus, Proposition 7.5(b) and result [7.2.8] 
imply that 


lim s-Var[s~'(u,., + Gas-1 to°° + 44,))] = o? [WP [17.8.3] 


soe 


Let a; denote the average change in y in a sample of T observations: 
T 
a = Pah > (y, os Yin) 


Consider the following estimate of the variance of the change in y over its value s 
periods earlier: 


a T-s 
Jr(s) = To 2, (Wes — Ye — Gps). [17.8.4] 

This should converge in probability to 
J(s) =, E(Y45 ae ds as)? = E(u. HF byggey Fe t+ 41)" [17.8.5] 


as the sample size T becomes large. Comparing this expression with [17.8.3], 
lim s~'-J(s) = o?-[p(1))*. 


Se 


Cochrane (1988) therefore proposed calculating [17.8.4] as a function of s. If 
the true process for y, is stationary or stationary around a deterministic time trend, 


17.8. Other Approaches.to Testing for Unit Roots 531 


this statistic should go to zero for large s. If the true process for y, is /(1), this 
Statistic gives a measure of the quantitative importance of permanent effects of ¢ 
as reflected in the long-run multiplier #(1). However, the statistic in [17.8.4] is 
not reliable unless s is much smaller than T. 

If the data truly followed a random walk so that #(L) = 1, then J(s) in 
[17.8.5] would equals -o? for any s, where ois the variance of u,. Lo and MacKinlay 
(1988) exploited this property to suggest tests of the random walk hypothesis based 
on alternative values of s. See Lo and MacKinlay (1989) and Cecchetti and Lam 
(1991) for evidence on the small-sample properties of these tests. 


Other Tests for Unit Roots 


The Phillips-Perron approach was based on an MA(~) representation for Ay,, 
while the Said-Dickey approach was based on an AR(~) representation. Tests based 
on a finite ARMA(p, q) representation for Ay, have been explored by Said and 
Dickey (1985), Hall (1989), Said (1991), and Pantula and Hall (1991). 

A number of other approaches to testing for unit roots have been proposed, 
including Sargan and Bhargava (1983), Solo (1984), Bhargava (1986), Dickey and 
Pantula (1987), Park and Choi (1988), Schmidt and Phillips (1992), Stock (1991), 
and Kwiatkowski, Phillips, Schmidt, and Shin (1992). See Stock (1993) for an 
excellent survey. Asymptotic inference for processes with near unit root behavior 
has been discussed by Chan and Wei (1987), Phillips (1988), and Sowell (1990). 


17.9. Bayesian Analysis and Unit Roots 


Up to this point in the chapter we have adopted a classical statistical perspective, 
calculating the distribution of 6 conditional on a particular value of p such as p = 1. 
This section considers the Bayesian perspective, in which the true value of p is 
regarded as a random variable and the goal is to describe the distribution of this 
random variable conditional on the data. 

Recall from Proposition 12.3 that if the prior density for the vector of unknown 
coefficients 8 and innovation precision o ~? is of the Normal-gamma form of [12.1.19] 
and [12.1.20], then the posterior distribution of B conditional on the data is mul- 
tivariate ¢. This result holds exactly for any finite sample and holds regardless of 
whether the process is stationary. Thus, in the case of the diffuse prior distribution 
represented by N = A = 0 and M™! = 9, a Bayesian would essentially use the 
usual ¢ and F statistics in the standard way. 

How can the classical distribution of 6 be strongly skewed while the Bayesian 
distribution of p is that of a symmetric ¢ variable? Sims (1988) and Sims and Uhlig 
(1991) provided a detailed discussion of this question. The classical test of the null 
hypothesis p = 1 is based only on the distribution of 6 when the true value of p 
is unity. By contrast, the Bayesian inference is based on the distribution of 6|p for 
all the possible values of p, with the distribution of 6|p weighted according to the 
prior probability for p. If the distribution of |p had the same skew and dispersion 
for every p as it does at p = 1, then we would conclude that, having observed any 
particular 6, the true value of p is probably somewhat higher. However, the dis- 
tribution of 6|p changes with p—the lower the true value of p, the smaller the 
skew and the greater the dispersion, since from [17.1.3] the variance of VT(é — p) 
is approximately (1 — p). Because lower values of p imply greater dispersion for 
6, in the absence of skew we would suspect that a given observation 6 = 0.95 was 
more likely to have been generated by a distribution centered at p = 0.90 with 


532 Chapter 17 | Univariate Processes with Unit Roots 


large dispersion than by a distribution centered at p = 1 with small dispersion. 
The effects of skew and dispersion turn out to cancel, so that with a uniform prior 
distribution for the value of p, having observed 6 = 0.95, it is just as likely that 
the true value of p is greater than 0.95 as that the true value of p is less than 0.95. 


Example 17.10 
For the GNP regression in Example 17.9, the probability that p 2 1 conditional 
on the data is the probability that a¢ variable with T = 164 degrees of freedom 
exceeds (1 — 0.94969)/0.019386 = 2.60. From Table B.3, this probability is 
around 0.005. Hence, although the value of p must be large, it is unlikely to 
be as big as unity. 


The contrast between the Bayesian inference in Example 17.10 and the clas- 
sical inference in Example 17.9 is one of the reasons given by Sims (1988) and 
Sims and Uhlig (1991) for preferring Bayesian methods. Note that the probability 
calculated in Example 17.10 will be less than 0.025 if and only if a classical 95% 
confidence interval around the point estimate 6 does not contain unity. Thus, an 
alternative way of describing the finding of Example 17.10 is that the standard 
asymptotic classical confidence region around f does not include p = 1. Even so, 
Example 17.9 showed that the null hypothesis of a unit root is accepted by the 
augmented Dickey-Fuller test. The classical asymptotic confidence region centered 
at p = 6 seems inconsistent with a unit root, while the classical asymptotic con- 
fidence region centered at p = 1 supports a unit root. Such disconnected confidence 
regions resulting from the classical approach may seem somewhat troublesome and 
counterintuitive.'* By contrast, the Bayesian has a single, consistent summary of 
the plausibility of different values of p, which is that implied by the posterior 
distribution of p conditional on the data. 

One could, of course, use a prior distribution that reflected more confidence 
in the prior information about the value of p. As long as the prior distribution was 
in the Normal-gamma class, this would cause us to shift the point estimate 0.94969 
in the direction of the prior mean and reduce the standard error and increase the 
degrees of freedom as warranted by the prior information, but a ¢ distribution 
would still be used to interpret the resulting statistic. 

Although the Normal-gamma class is convenient to work with, it might not 
be sufficiently flexible to reflect the researcher’s true prior beliefs. Sims (1988, p. 
470) discussed Bayesian inference in which a point mass with positive probability 
is placed on the possibility that p = 1. DeJong and Whiteman (1991) used numerical 
methods to calculate posterior distributions under a range of prior distributions 
defined numerically and concluded that the evidence for unit roots in many key 
economic time series is quite weak. 

Phillips (1991a) noted that there is a prior distribution for which the Bayesian 
inference mimics the classical approach. He argued that the diffuse prior distri- 
bution of Proposition 12.3 is actually highly informative in a time series regression 
and suggested instead a prior distribution due to Jeffreys (1946). Although this 
prior distribution has some theoretical arguments on its behalf, it has the unusual 
property in this application that the prior distribution is a function of the sample 
size T—Phillips would propose using a different prior distribution for f(o) when 


‘Recall from Proposition 12.3(b) that the degrees of freedom are given by N* = N + T, Thus, 
the Bayesian interpretation is not quite identical to the classical ¢ statistic, whose degrees of freedom 
would be T — k. 

“Stock (1991) has recently proposed a solution to this problem from the classical perspective. 
Another approach is to rely on the exact small-sample distribution, as advocated by Andrews (1993). 


17.9. Bayesian Analysis and Unit Roots 533 


the analyst is going to obtain a sample of size 50 than when the analyst is going to 
obtain a sample of size 100. This would not be appropriate if the prior distribution 
is intended to represent the actual information available to the analyst before seeing 
the data. Phillips (1991b, pp. 468-69) argued that, in order to be truly uninfor- 
mative, a prior distribution in this context would have this property, since the larger 
the true value of p, the more rapidly information about p contained in the sample 
{¥, Ya.» ++.» Yr} is going to accumulate with the sample size T. Certainly the 
concept of what it means for a prior distribution to be ‘‘uninformative” can be 
difficult and controversial .'* 

The potential difficulty in persuading others of the validity of one’s prior 
beliefs has always been the key weakness of Bayesian statistics, and it seems 
unavoidable here. The best a Bayesian can do may be to take an explicit stand on 
the nature and strength of prior information and defend it as best as possible. If 
the nature of the prior information is that all values of p are equally likely, then 
it is satisfactory to use the standard OLS ¢ and F tests in the usual way. If one is 
unwilling to take such a stand, then Sims and Uhlig urged that investigators report 
both the classical hypothesis test of p = 1 and the classical confidence region 
around 6 and let the reader interpret the results as he or she sees fit. 


APPENDIX 17.A. Proofs of Chapter 17 Propositions 
@ Proof of Proposition 17.2. Observe that 


2 = DD wees 


=) sot jaa 

= {ope + dhe. + Yor toe + Ween t+ Ware + oe} 
+ {Une + Wie + neg to + heen t Wee to} 
+ {Uyt-2 + Preis + Weta toe + arty + Woe to} 
tort {hier + nes + re + °° 

= dee + (dn + Were + (dy + dy + dele 2 + 0 
+ (hy tit pete + weer + a + be toe + en 
+ (fo tis tie + tae to 

= (da t ht a te — Wt he ty te, 


+ (fy + i + te toe en — (de te ter 

+ (We +n + de Hoven — (Wy + a to enn tt 

+ (Un + dy + oe tee — (ht a Fe 

+ (dh + de + Ys too eu — (Yar + tae te 

oe (Yo = Ws + Ws + ‘Jel, a (Haz + Yrad Re ely hese 
or 

24 = WL) De +m — Ms [17.4.1] 

where 


M=-Uh te tito de -(h tht to ey 
~_ (os + Us + tbs + se)e_y — tee 

m= “Uh tbe tis to eo— Wh tet Wy toe 

— (be + Wet de toeje- 


'SSee the many comments accompanying Phillips (1991a). 


534 Chapter 17 | Univariate Processes with Unit Roots 


Notice that n, = Ziue,e,-,, where a, = —(Uj., + Ws2 + °°), with {a,}%, absolutely 
summable: 


> Ia Slhtietis te ltleththtocl tla tatu teelte: 


= {]yil + [ye] + [al +--+} + {bel + [yal + [yal +--+} 
+ {]Wal + lya) + [ys] + bt 
= |u| + 2|f5| + 3)y,] + °° 


= Di ldl 
jut 
which is bounded by the assumptions in Proposition 17.2, ™ 


@ Proof of Proposition 17.3. 


(a) This was shown in [17,5.9]. 

(b) This follows from [7.2.17] and the fact that E(u?) = yy. 
(c) This is implied by [7.2.14]. 

(d) Since & = 2!_,u,, Proposition 17.2 asserts that 


& = W(1) 2 & +1 — Tw [17.4.2] 


Hence, 


HT 


T tt 
TD (ve 2a + ear mde 


T 
w(1)-T-! > (6 + ey tore + & 18, (17.A.3] 


T 
i = 18 
aml 


T 
+77 2, (m1 — N)é- 
But (17.3.26] established that 
T 
T-'D> (e+ ep te + Ge, (L2)o2-{(W(DP - 1. [17.4.4] 
=2 


Furthermore, Proposition 17.2 ensures that {(m,., — mej, is a martingale difference 
sequence with finite variance, and so, from Example 7.11, 


i ¥ (ne - me —> 0. [17.A.5] 
Substituting [17.4.4] and [17.A.5] into [17.4.3] yields 
T 
Yi > E18, > (1/2)0?- (WD) (WP - Uh, (17.A.6] 


as claimed in (d). 
(e) For j = 0 we have from (17.1.11] that 


a s &_\u, = (/2)T~'€} — (12)T~ "(uz + ud + + wR). [I7.A.T] 
gol 


But 
T-'€} = [T7%(u, + uy te + up) PA (WDP > [17.4.8] 


from result (a). Also, 


2 3, P 
Tuy + up + + + UZ) > % 


A pendix 17.A. Proofs of Chapter 17 Propositions 53 


from result (c), Thus, [17.4.7] converges to 
T 
TD & 1S (12) [WOP - yds [17.4.9] 
= 
which establishes result (e) for j = 0. 
For j > 0, observe that 


ers = ee jet + u,-; + Uy p41 Pesce U,- 45 
implying that 


Tr T 
rT! > &-u4,-, = To! > (E,-yea + Way + Uy ajar Fo Fu; 


papel safer 


r . 
= T-' > §,-7- 14-3 [17.A.10] 


pagel 


o 
4 TOD (ij + aij tv + 


rage 


But 


T T- 
Tbe itie = UT HITE = DDE wu (1/2){a?-[W(1)P - wh 


past 


as in (17.A.9]. Also, 


T 
P 
To'sS (u,_, + Uys sa U,—)U,-;-> went Yo tee + Viet 


c=ptl 
from result (c). This, [17-A.10] converges to 
7-8 3 fiat, -& (RVR WDE = aah + eH ae Hee Eh 
Clearly, T-' 27, €,-,u,-; has the same asymptotic distribution, since 
T > £,.U,-,-> 0. 


(f) From the definition of é, in [17.5.11] and X;(r) in [17.5.4], it follows as in [17.3.15] 
that . 


ty VT‘ X,(r) dr = T-* y bau 


Result (f) then follows immediately from [17.5.5]. 
(g) First notice that 
Tr T 


Tee DS t,_, = roe > (t ~ J + J)u,-;, 


gat rt 
where j-T-*2 57, u,_;~> 0. Hence, 
T Pp T e T 
T-* 3 tu, T7* D> t - fu; 77? Du, 
rat eel gel 
But from [17.3.19], 
T T T L 1 
T-* Sm, = T7? You, — T7? DE, a- Wl) - af W(n) dr, 
tal em geal 
by virtue of (a) and (f). 


(h) Using the same analysis as in [17.3.20] through [17.3.22], for ¢, defined in (17.5.11] 


2 


536 Chapter 17 | Univariate Processes with Unit Roots 


and X,,(r) defined in [17.5.4], we have 


TMGUT + YT + + /Th= [VEX (OF ar (o-wDP: [| (WOOP ar, 


by virtue of [17.5.5]. 
(i) As in (17.3.23], 


T T 
T-* 3 #é,., = TD WT) (&-1/T?) 
t=1 eq 


qT ib {((Tr]* + WT} (uy + uy + 00+ + eT} dr 


TH E {({Tr]* + 1/T}-X-(r) dr 
4 o-wy:[! won ar, 


from [17.5.5] and the continuous mapping theorem. 
(j) From the same argument as in (i), 


T 


tT S 2, = > (WT )\(E2,/T?) 


r=1 1 


rf {([Tr]* + 1/T}-{(uy + uy + ose) + Uppy V/TF ar 


rf (ary + VT} XP ar 


4s fo-wenp |) WOR ar. 


(k) This is identical to result (h) from Proposition 17.1, repeated in this proposition 
for the reader’s convenience. &™ 


Chapter 17 Exercises 


17.1. Let {u,} be an i.i.d. sequence with mean zero and variance uv’, and let y, = u, + 
u; + +++ + u, with yy = 0. Deduce from [17.3.17] and (17.3.18] that 


[ras Jeo] Lf i): 


where 2 indicates summation over ¢ from 1 to 7. Comparing this result with Proposition 


17.1, argue that 
road lol Lt i) 
~ Ni 5 ; 
f w(r) ‘ oid sy)” 


where the integral sign denotes integration over from 0 to 1. 


17.2. Phillips (1987) generalization of case 1. Suppose that data are generated from the 
process y, = y,_, + u,, where u, = p(L)e,, Tia j-|¢;| <™, and ¢, is iid. with mean zero, 
variance o?, and finite fourth moment. Consider OLS estimation of the autoregression y, 
= py, + u,. Let 6, = (Zy2_,)~'(Zy,_1y,) be the OLS estimate of p, s2 = (T -— 1)-' x 
Zu? the OLS estimate of the variance of the regression error, 63, = s}-(Zy2.,)~! the OLS 
estimate of the variance of 6,, and tr = (fr — 1)/6,, the OLS t test of p = 1, and define 


C anter 17 Exercises 537 


A = o' (1). Use Proposition 17.3 to show that 


Hae-[WD)P — vd. 


(a) T(é, - 1) 5 
af [W()f ar 


Yo 


(b) 72-63, 4+ —-—*—_; 
ye i [W(r)P dr 


(0) ty 2s (%y)r2 {AU DE =, AQ? =) 


{f wer a} ff wor ar\" | 


(d) T(r — 1) — 4(T?+63, + 83)? - wW) eae LA) ied 
(W(r)F dr 


uf ia 
(e) (Yp/A2)! ty = {t(A2 eas YA} x {T-6,, oe Sp} SWE) 
{f ovine ar} 


Suggest estimates of y. and A? that could be used to construct the statistics in (d) and (e), 
and indicate Where one could find critical values for these statistics. 


17.3. Phillips and Perron (1988) generalization of case 4. epee that data are generated 
from the process y, = a + y,., + u,, where u, = W(L)e,, Zio f*|yy| < , and ¢, is i.i.d, 

with mean zero, variance o*, and finite fourth moment, and where a can be any value, 

including zero. Consider OLS estimation of 

y, = at py,_, + Of + u, 
As in [17.4.49], note that the fitted values and estimate of p from this regression are identical 
to those from an OLS regression of y, on a constant, time trend, and ¢,_, = y,_, — a(f — 1): 
y, = a + p*é,_, + Ft + yu, 


where, under the assumed data-generating process, é, satisfies the assumptions of Proposition 
17.3. Let (&}, At, 57)’ be the OLS estimates given by equation [17.4.50}, $7 = (T-3)-'x 
24? the OLS estimate of the variance of the ple error, 63; the OLS estimate 1 
the variance of f7 given, in [17.4. 34], and t}. = — 1)/6,; the OLS ¢ test of p = 
Recall further that br, GF, and f7 are Muncie identical to the analogous ara 
for the original regression, 6,, 63,, and f,. Finally, define A = o-y(1). Use Proposition 17.3 
to show that 
1 T-*73é,_, T-*3t 
(a) | TO 2E,_, TBE, To SPDE,_ at 
T "Xt = TS Eré,_, T-3E0? 


1 | W(r) dr 1/2 


1 0 
|| [wo dr [ wor dr [ wo drij Oa 
0 0 


V2 { rW(r) dr 1/3 


a) 
mm O 


T-"3u, 10 0 W(1) 
(b) | T-'Se,_w,| Salo a oll aware - bv/aent: 
T7323 tu, 00 1 W(1) — | Wr) dr 


538 Chapter 17 | Univariate Processes with Unit Roots 


1 [ we dr 1/2 

10 0 

©) | T@E-1) | >]0 10 [wo dr [ wore [ wo ar 
Pe eee LO Oe v2 [ W()dr 18 

w(t) 


x | HWP — [w/a fs 


{way = [wo ar} 


1 { W(r) dr 1/2 
0 
(763,570 1 af wea fowora [mma [1 
0 
1/2 [ mm dr 1/3 
= (s}/A2)- Q; 
(€) tr (A/y)"?- Tbr — WVO; 
(0) T(br ~ 1) ~ 4(7?-63, + 52)? ~ ») 


1 i W(r) dr 12 
4[0 1 0 | W(r) ar { [W(r)P dr if rW(r) dr 
mee { rW(r)dr 03 
W(1) 


x} HWP - 
Ww(l) - [ww dr 
= VY; 
(8) (o/A2) ot, — (4A? — VA} x {T-6, + Sp} SV + VEO. 

Suggest estimates of ‘y, and A? that could be used to construct the statistics in (f) and (g), 
and indicate where one could find critical values for these statistics. 
17.4. Generalization of case 1 for autoregression. Consider OLS estimation of 

y, = GAy,, + Ay. +o + by — AY, — p41 + py,-1 + &, 
where «¢, is i.i.d. with mean zero, variance o*, and finite fourth moment and the roots 
of (1 -— Zz — fz? — + = _,2°-!) = 0 are outside the unit circle. Define A = 


ol{(1 - 5 ~ £2 - a t,-1) and Y= E{(Ay,)(Ay,-;)}- Let tr = (g,, rT b, Tree ae ur) 
be the (p — 1) x 1 ae of estimated OLS coefficients on the lagged changes i in y, and 
let £ be the corresponding true eg Show that if the true value of p is unity, then 


sg 8} : | h, i 
T@,-1)] [* | [WOOP dr) Ltoa(WwDP - 9 


where V is the [(p — 1) x (p — 1)] matrix defined in [17.7.19] and h, ~ N(0, 07V). Deduce 
from this that . 


(a) T(E, - £) > N(O, o?V-'); 
bs F 7 4 i = 
(b) T(6r - IW ~ bie - Br ~ 2 - Bir) > wae 
[ owopar 


zt 22 
Gai ee 
{ i [WP ar} 


Where could you find critical values for the statistics in (b) and (c)? 
17.5, Generalization of case 3 for autoregression. Consider OLS estimation of 
y = GAy,, + f2Ay,ng to + 6-1 AYe— pit Fa + pyri + & 


where ¢, is i.i.d. with mean zero, vatiance o?, and finite fourth moment and the roots of 
(1 — gz — {,27 — ++» — {,_,z°7!) = 0 are outside the unit circle. 


(a) Show that the fitted values for this regression are identical to those for the following 
transformed specification: 


VM = Gujer + Saya to + Gath par FM + OH + &, 
where u, = Ay, — wand uw = a/(1 — g; — f2 — ++ — Z,-4). 
(b) Suppose that the true value of p is 1 and the true value of @ is nonzero. Show 
that under these assumptions, 


u, = (i -OL- 0,.L7-—-:: - op: iL?~')le, 
rot u(t = 1) + g,-15 


where 
$1 = Yo $y + uy toes + wy. 


Conclude that for fixed yy, the variables u, and ¢, satisfy the assumptions of Proposition 17.3 
and that y, is dominated asymptotically by a time trend. 


(c) Let y, = E(uy,.,), and let £, = (hn dan .--, dur)’ be the (p — 1) x 1 
vector of estimated OLS coefficients on (u,_,, u)-2, --- 5 U,-pa1); these, of course, are 
identical to the coefficients on (Ay,_,, Ay,_2,. . . , Ay,-p41) in the original regression. Show 


that if p = l anda # 0, 


T(E, — £) Vo. 07th, 
Ti, — w)]-> {0 1 w2] Iho}, 
T**(6, — 1) 0 w/2 w7/3) [Ay 


where 
h, 0 vo 0 
hy] ~ N[ OJ, o7/0° 1 pl2 
hy 0. 0 pw2 pf 


and V is the matrix in (17.7.19]. Conclude as in the analysis of Section 16.3 that any OLS 
t or F test on the original regression can be compared with the standard ¢ and F tables to 
give an asymptotically valid inference. 


17.6. Generalization of case 4 for autoregression. Consider OLS estimation of 
y = g,Ay,-, + G2Ay,_2 ap ee oe $,-14Y,—-pat tat PY,-1 + ot + En, 

where «, is i.i.d. with mean zero, variance o?, and finite fourth moment and then roots of 
(1 — iz — 27 — «++ — f,_,z"71) = 0 are outside the unit circle. 

(a) Show that the fitted vatues of this regression are numerically identical to those of 
the following erie 

= fue, + Saty-2 Fo + S-type + we + p&1 + Ot + &,, 

where u, ie ea ~4-&- op~i)s B* = (1 — p)u, -1 = V1 


uF 1), and 6* = 6 + py. Note that the Stand feottnciente i, and 6, and their standard 
errors will be identical for the two regressions. 


(b) Suppose that the true value of p is 1 and the true value of 6 is 0. Show that under 
these assumptions, 


= (WA - Gb - 6b? -  — Glee, 
G1 = Yo tay tu, to + hy. 


540 Chapter 17 | Univariate Processes with Unit Roots 


Conclude that for fixed y,, the variables u, and é, satisfy the assumptions of Proposition 
17,3. 


(c) Again let p = 1 and 6 = 0, and define y, = E(u,u,_,) and 


Azol(l—%-&- + - G1). 
Show that at 

Vv 0 0 0 

mE, — 8) ; 

Tat ie 0 1 a-| W(r) dr 1/2 

T(é, — 1) a - +. 2 : 

ress _ 8*) oA [wo dr xX [wor dr 2h [ we dr 
0’ 1/2 af rW(r) dr 13 

hy 
oa: W(1) 


loa(W(D)P - 1 
o-{wa) - [ww ar} 


where h, ~ N(0, o7V) and V is as defined in [17.7.19]. 
(d) Deduce from answer (c) that 
TG, — 0) 3 NO, oV-'); 


T(ér — I - bir or br ee eas f,-.1) 
1 i W(r) dr 2]. W(1) 
+0 1 0] W(r) dr | (W(n)F dr i rW(r) dr u(Ww(L)F — Y 
12 [ wo dr 1/3 W(1) — | W(r) dr 
= V; 
(6; — 1/6, > V + VO, 
where 
1 { W(r) dr 1/2 i: 
0 
Q=(0 1 Qj i W(r) dr i (W(r)P dr { rW(r) dr ; 
0 
V/2 [ mo dr 1/3 


Notice that the distribution of V is the same as the asymptotic distribution of the variable 
tabulated for case 4 in Table B.5, while the distribution of V/V is the same as the asymptotic 
distribution of the variable tabulated for case 4 in Table B.6. 


Chapter 17 References 


Andrews, Donald W. K. 1991. “‘Heteroskedasticity and Autocorrelation Consistent Co- 
variance Matrix Estimation.” Econometrica 59:817-58. 

. 1993. “Exactly Median-Unbiased Estimation of First Order Autoregressive/Unit 
Root Models.” Econometrica 61:139-65. : 
Beveridge, Stephen, and Charles R. Nelson. 1981. “A New Approcah to Decomposition 
of Economic Time Series into Permanent and Transitory Components with Particular At- 
tention to Measurement of the ‘Business Cycle.’” Journal of Monetary Economics 7:151— 
74, 


C anter 17 References 541 


Bhargava, Alok. 1986. “On the Theory of Testing for Unit Roots in Observed Time Series.” 
Review of Economic Studies 53:369-84. 

Billingsley, Patrick. 1968. Convergence of Probability Measures. New York: Wiley. 
Campbell, John Y., and Pierre Perron. 1991. “Pitfalls and Opportunities: What Macroe- 
conomists Should Know about Unit Roots.” NBER Macroeconomics Annual. Cambridge, 
Mass.: MIT Press. 

Cecchetti, Stephen G., and Pok-sang Lam. 1991. “What Do We Learn from Variance Ratio 
Statistics? A Study of Stationary and Nonstationary Models with Breaking Trends.”’ De- 
partment of Economics, Ohio State University. Mimeo. 

Chan, N. H., and C. Z, Wei. 1987. “Asymptotic Inference for Nearly Nonstationary AR(1) 
Processes.” Annals of Statistics 15:1050-63. 

and . 1988. “Limiting Distributions of Least Squares Estimates of Unstable 
Autoregressive Processes.” Annals of Statistics 16:367-401. 

Cochrane, John H. 1988. ‘‘How Big Is the Random Walk in GNP?” Journal of Political 
Economy 96:893-920. 

DeJong, David N., and Charles H. Whiteman. 1991. “Reconsidering ‘Trends and Random 
Walks in Macroeconomic Time Series.’” Journal of Monetary Economics 28:221-54, 
Dickey, David A., and Wayne A. Fuller. 1979. “Distribution of the Estimators for Auto- 
regressive Time Series with a Unit Root.” Journal of the American Statistical Association 
74:427-31. 

and . 1981. “Likelihood Ratio Statistics for Autoregressive Time Series with 
a Unit Root.”” Econometrica 49:1057-72. 

and S. G. Pantula. 1987. ‘‘Determining the Order of Differencing in Autoregressive 
Processes.” Journal of Business and Economic Statistics 5:455-61. 

Evans, G. B. A., and N. E. Savin. 1981. ‘Testing for Unit Roots: 1.”" Econometrica 49:753- 
19. * 

and . 1984. “Testing for Unit Roots: 2.” Econometrica 52:1241-69, 

Fuller, Wayne A. 1976. Introduction to Statistical Time Series. New York: Wiley. 

Hail, Alastair. 1989. ‘‘Testing for a Unit Root in the Presence of Moving Average Errors.” 
Biometrika 716:49-56. 

. 1991, “Testing for a Unit Root in Time Series with Pretest Data Based Model 
Selection.” Department of Economics, North Carolina State University. Mimeo. 

Hall, P., and C. C. Heyde. 1980. Martingale Limit Theory and Its Application. New York: 
Academic Press. 

Hansen, Bruce E, 1992, ‘Consistent Covariance Matrix Estimation for Dependent Heter- 
ogeneous Processes.” Econometrica 60:967-72. 

Jeffreys, H. 1946. ‘An Invariant Form for the Prior Probability in Estimation Problems.” 
Proceedings of the Royal Society of London Series A, 186:453-61. 

Kim, Kiwhan, and Peter Schmidt. 1990. ‘Some Evidence on the Accuracy of Phillips-Perron 
Tests Using Alternative Estimates of Nuisance Parameters.’ Economics Letters 34:345-50. 
Kwiatkowski, Denis, Peter C. B. Phillips, Peter Schmidt, and Yongcheol Shin. 1992. “Test- 
ing the Null Hypothesis of Stationarity against the Alternative of a Unit Root: How Sure 
Are We That Economic Time Series Have a Unit Root?” Journal of Econometrics 54:159- 
78. 

Lo, Andrew W., and A. Craig MacKinlay. 1988. “Stock Prices Do Not Follow Random 
Walks: Evidence from a Simple Specification Test.” Review of Financial Studies 1:41-66. 
and . 1989, “The Size and Power of the Variance Ratio Test in Finite Samples: 
A Monte Carlo Investigation.’ Journal of Econometrics 40:203-38. 

Malliaris, A. G., and W. A. Brock. 1982. Stochastic Methods in Economics and Finance. 
Amsterdam: North-Holland. 

Pantula, Sastry G., and Alastair Hall. 1991. ‘‘Testing for Unit Roots in Autoregressive 
Moving Average Models: An Instrumental Variable Approach.” Journal of Econometrics 
48:325-53. 

Park, Joon Y., and B. Choi. 1988. “A New Approach to Testing for a Unit Root.” Cornell 
University. Mimeo. 

Park, Joon Y., and Peter C. B. Phillips. 1988. ‘Statistical Inference in Regressions with 
Integrated Processes: Part 1.’’ Econometric Theory 4:468-97. 


542 Chapter 17 | Univariate Processes with Unit Roots 


and . 1989. “Statistical Inference in Regressions with Integrated Processes: 
Part 2.” Econometric Theory 5:95—131. 


Phillips, P. C. B. 1986. “‘Understanding Spurious Regressions in Econometrics.” Journal of 
Econometrics 33:311-40. 


. 1987. “Time Series Regression with a Unit Root.” Econometrica 55:277-301. 


~—. 1988. “‘Regression Theory for Near-Integrated Time Series.” Econometrica 56:1021- 
43. 


, 1991a. “To Criticize the Critics: An Objective Bayesian Analysis of Stochastic 
Trends.” Journal of Applied Econometrics 6:333-64. 

. 1991b. “Bayesian Routes and Unit Roots: De Rebus Prioribus Semper Est Dis- 
putandum.” Journal of Applied Econometrics 6:435-73. 

and Pierre Perron. 1988. “Testing for a Unit Root in Time Series Regression,” 
Biometrika 75:335-46, 

and Victor Solo. 1992, ‘‘Asymptotics for Linear Processes.” Annals of Statistics 
20:971-1001, 

Said, Said E, 1991. ‘“‘Unit-Root Tests for Time-Series Data with a Linear Time Trend.” 
Journal of Econometrics 47:285-303. 

and David A. Dickey, 1984. “Testing for Unit Roots in Autoregressive—-Moving 
Average Models of Unknown Order.”’ Biometrika 71:599-607. 

and . 1985. “Hypothesis Testing in ARIMA(p, 1, g) Models.” Journal of the 
American Statistical Association 80:369-74. 

Sargan, J. D., and Alok Bhargava. 1983. ‘‘Testing Residuals from Least Squares Regression 
for Being Generated by the Gaussian Random Walk.” Econometrica 51:153-74. 

Schmidt, Peter, and Peter C. B. Phillips. 1992. “LM Tests for a Unit Root in the Presence 
of Deterministic Trends."’ Oxford Bulletin of Economics and Statistics 54:257-87, 

Schwert, G. William. 1989. “Tests for Unit Roots: A Monte Carlo Investigation.” Journal 
of Business and Economic Statistics 7:147-59. 

Sims, Christopher A. 1988. “Bayesian Skepticism on Unit Root Econometrics.” Journal of 
Economic Dynamics and Control 12:463-74. 

, James H. Stock, and Mark W. Watson. 1990, “Inference in Linear Time Series 
Models with Some Unit Roots.” Econometrica 58:113-44. 

and Harald Uhlig. 1991. ‘‘Understanding Unit Rooters: A Helicopter Tour.” Econ- 
ometrica 59:1591-99, 

Solo, V. 1984. “The Order of Differencing in ARIMA Models.” Journal of the American 
Statistical Association 79:916-21. 

Sowell, Fallaw. 1990. ‘‘The Fractional Unit Root Distribution.” Econometrica 58:495-505. 
Stinchcombe, Maxwell, and Halbert White. 1993. ““An Approach to Consistent Specification 
Testing Using Duality and Banach Limit Theory.” University of California, San Diego. 
Mimeo. 

Stock, James H. 1991, ‘Confidence Intervals for the Largest Autoregressive Root in U.S. 
Macroeconomic Time Series.” Journal of Monetary Economics 28:435—59. 

. 1993, “Unit Roots and Trend Breaks,” in Robert Engle and Daniel McFadden, 
eds., Handbook of Econometrics, Vol. 4. Amsterdam: North-Holland. 

White, J. S. 1958. “The Limiting Distribution of the Serial Correlation Coefficient in the 
Explosive Case.” Annals of Mathematical Statistics 29:1188-97. 


Chapter 17 References 543 


18 


Unit Roots 
in Multivariate 
Time Series 


The previous chapter investigated statistical inference for univariate processes con- 
taining unit roots. This chapter develops comparable results for vector processes. 
The first section develops a vector version of the functional central limit theorem. 
Section 18.2 uses these results to generalize the analysis of Section 17.7 to vector 
autoregressions. Section 18.3 discusses an important problem, known as spurious 
regression, that can arise if the error term in a regression is /(1). One should be 
concerned about the possibility of a spurious regression whenever all the variables 
in a regression are (1) and no lags of the dependent variable are included in the 
regression. * 


18.1. Asymptotic Results 
for Nonstationary Vector Processes 


Section 17.2 described univariate standard Brownian motion W(r) as a scalar con- 
tinuous-time process (W: r & [0, 1] — R*). The variable W(r) has a N(0, r) dis- 
tribution across realizations, and for any given realization, W(r) is a continuous 
function of the date r with independent increments. If a set of n such independent 
processes, denoted W,(r), W.(r),..., W,(r), are collected in an (nm < 1) vector 
W(r), the result is n-dimensional standard Brownian motion. 


Definition: n-dimensional standard Brownian motion W(:) is a continuous-time 
process associating each date r & [0, 1] with the (n x 1) vector W(r) satisfying the 
following: 


(a) W(0) = 0; 
(b) For any dates0 Sr, <1r,2<+-- <7 <1, the changes [W(r2) — W(r,)], 
[W(r3) — W(r2)],.- - ,[W(r.) — W(rx-1)] are independent multivariate Gaus- 


sian with [W(s) — W(r)} ~ N(O, (s — 7-1); 
(c) For any given realization, W(r) is continuous in r with probability 1. 


Suppose that {v,}%, is a univariate i.i.d. discrete-time process with mean zero 
and unit variance, and let 
XM(r) = Ty, + vg ++ + VE); 
where [Tr]* denotes the largest integer that is less than or equal to Tr. The func- 


544 


tional central limit theorem states that as T—> ~, 
VE-XH) 3 WC). 
This readily generalizes. Suppose that {v,}* ; is an n-dimensional i.i.d. yector proc- 
ess with E(v,) = 0 and E(v,v,) = L,, and let 
Ker)=T Ov tv. t--+ + Vere) 
Then 
VE-X4(-) 3 WC). [18.1.1] 


Next, consider an i.i.d. n-dimensional process {e}7_, with mean zero and 
variance-covariance matrix given by Q. Let P be any matrix such that 


Q = PP’; [18.1.2] 


for example, P might be the Cholesky factor of OQ. We could think of e, as having 
been generated from 


e, = Py,, [18.1.3] 


for v, i.i.d. with mean zero and variance I,,. To see why, notice that [18.1.3] implies 
that e, is i.i.d. with mean zero and variance given by 


E(e,e,;) = P-E(v,v;)‘P’ = P-1,-P’ = Q. 


Let 
7(r) = T7(e, + &2 +++ + Epp) 
= P-T7 (vy, + v2 +--+ + Very) 
= P-X#(r). 
It then follows from [18.1.1] and the continuous mapping theorem that 
VEX#(-) S PWC). [18.1.4] 


For given r, the variable P-W(r) represents P times a N(0, r-I,,) vector and so has 
a N(0, r-PP’) = N(O, r-Q) distribution. The process P-W(r) is described as 
n-dimensional Brownian motion with variance matrix ©. 

The functional central limit theorem can also be applied to serially dependent 
vector processes using a generalization of Proposition 17.2.! Suppose that 


= > Wes [18.1.5] 
s=0 


where if yi; “) denotes the row i, column j element of ¥,, 


Ey 


> s|¥P| < © 


s= 


for each i,j = 1,2,...,. Then algebra virtually identical to that in Proposition 
17.2 can be used to show that 

t t 

2 = HC) De + a — no, [18.1.6] 


where W(1) = (W, + W, + W. + ---) and y, = Tho ae,_, for a, 
'This is the approach used by Phillips and Solo (1992). 


18.1. Asymptotic Results for Nonstationary Vector Processes 545 


—(Woaa + Wren + Vous + ++), and {a,}%, is absolutely summable. Expression 
[18.1.6] provides a multivariate generalization of the Beveridge-Nelson decom- 
position. 

If u, satisfies [18.1.5] where e, is i.i.d. with mean zero,.variance given by 
Q. = PP", and finite fourth moments, then it is straightforward to generalize to 
vector process the statements in Proposition 17.3 about univariate processes. For 
example, if we define 


(rq 
Xr(r) = (1/T) 5 u,, [18.1.7] 


then it follows from [18.1.6] that 


(77]* 
VTX_(r) = rower) > &, + Un — ne): 


s= 


As in Example 17.2, one can show that 


sup To? lai erae — nal 40. 


re[0,1] 
7=1,2,,..,0 
It then follows from [18.1.4] that 
VE-X-) & (1) P-VEKE) S W(1)-P-W(-), [18.1.8] 
where W(1)-P-W(r) is distributed N(0, r[W(1)]--[W(1)]’) across realizations. Fur- 
thermore, for €, = u, + u, + ++ + U,, we have as in [17.3.15] that 


T 1 L 1 
T-32 » £-1= [ VT-X7(r) dr—> w1)-P-[ W(r) dr, [18.1.9] 


which generalizes result (f) of Proposition 17.3. 

Generalizing result (e) of Proposition 17.3 requires a little more care. Consider 
for illustration the simplest case, where v, is an i.i.d. (” x 1) vector with mean 
zero and E(v,v;) = I,. Define 


Et = ytvy+t-+- ty, fort=1,2,...,T 
: 0 for t = 0; 


we use the symbols v, and €* here in place of u, and &, to emphasize that y, is i.i.d. 
with variance matrix given by I,,. For the scalar i.i.d. unit variance case (mn = 1, 
A = ¥o = 1), result (e) of Proposition 17.3 stated that 


T 
a > Et .v,> HIW()P - 1. [18.1.10] 


The corresponding generalization for the i.i.d. unit variance vector case (n > 1) 
turns out to be 


T 
, ‘ L , 
To 2 Bay + VEE (WOW) ~ Ens [18.1-11] 
t= 
see result (d) of Proposition 18.1, to follow. Expression [18.1.11] generalizes the 


scalar result [18.1.10] to an (m X m) matrix. The row i, column i diagonal element 
of this matrix expression states that 


T 
TY) (Eh -Ve + Yad > [WAP - 1, [18.1.12] 

tml 
where £¥, vy, and W,(r) denote the ith elements of the vectors &*, v,, and W(r), 


546 Chapter 18 | Unit Roots in Multivariate Time Series 


respectively. The row i, column j off-diagonal element of [18.1.11] asserts that 
T 
TO 2 {EFM + vnkf— a3 [Wi] [W1)] = fori # j.  (18.1.13] 


Thus, the sum of the random variables T~'27L,é%-1v, and T-'E2v,€t,-1 
converges in distribution to the product of two independent standard Normal variables. 

It is sometimes convenient to describe the asymptotic distribution of 
T~*37.1é%-1v;, alone. It turns out that 


T 1 
Y de > tar | W,(r) dW;(r). [18.1.14] 
& 


This expression makes use of the differential of Brownian motion, denoted dW,(r). 
A formal definition of the differential dW,(r) and derivation of [18.1.14] are some- 
what involved—see Phillips (1988) for details. For our purposes, we will simply 
regard the right side of [18.1.14] as a compact notation for indicating the limiting 
distribution of the sequence represented by the left side. In practice, this distribution 
is constructed by Monte Carlo generation of the statistic on the left side of [18.1.14] 
for suitably large T. 
It is evident from [18.1.13] and [18.1.14] that 


1 1 
[ Wi(r) dW,(r) + [ Wr) dWir) = W,(1)-W,(1) fori # j, 
whereas comparing [18.1.14] with [18.1.12] reveals that 


1 
f W,(r) dW,(r) = 4[W,(1)]}? - 1). [18.1.15] 
The expressions in [18.1.14] can be collected for i,j = 1,2,...,ninan 
(n X n) matrix: 
T 1 
T > evi S [won ewer’ [18.1.16] 
(= 


The following proposition summarizes the multivariate convergence results 
that will be used in this chapter.” 


Proposition 18.1; Let u, be an (n X 1) vector with 
= W(L)e, = » W,e,_., 


where {s-‘W,}*_, is absolutely summable, that is, D729 s° ly| < © for each i, j = 1, 
2,...,"for Ge the row i, column j element of W,. Suppose that {e,} is an i.i.d. 
sequence with mean zero, finite fourth moments, and E(e,e;) = a positive definite 
matrix. Let Q = PP’ denote the Cholesky factorization of Q, and define 


oy, = E(ey&,) = row i, column j element of O 


T, = E(uu;_,) = > W,,,QW, fors =0,1,2,... 


(nxn) 
Uy 
U_2 . 
, 2 for arbitrary v 21 (18.1.17] 
(av x1) ‘ 
U_, 


?These or similar results were derived by Phillips and Durlauf (1986), Park and Phillips (1988, 1989), 
Sims, Stock, and Watson (1990), and Phillips and Solo (1992). 


18.1. Asymptotic Results for Nonstationary Vector Processes 547 


1 Ty Ty_2 
V = E(zz2i) = ; 
(iv x ny) 3 ‘ . 
Payot | eer eae T> 
A =W(1)P = (% +81 + B+: ‘)-P [18.1.18] 
(nxn) 
& =u,tu,t-e-:t+u, fort=1,2,...,T (18.1.19] 


(nx 1) 


with &, = 0. Then 


T 
(@) TY u,> AW); 
t= 
a “L 
(b) T72 3¥ 2,e,> NO, oyV) fori =1,2,...,05 
t=1 
T 
(c) T7 Suu!_,50, fors =0,1,2,...; 
t=1 


r 
(@) T-? >> (E-10;-, + Uy-,8-1) 


i] 
oa 


L 
=> 


A-[W(1)}-[W()}'-A" — To fors 
eee 25 Ty fors 


| ay eres 
T L 1 ey 
@) TD gai af [ [WO] fawn -a’ +20 
T L 1 
(f) T7? >» £18; > aff [W(r)] fawn} 
(g) T~*? > E13 af Wr) dr; 
tal 0 
T L 1 
(hk) T7323. tu,_,3 awa) - li W(r) ar} fors =0,1,2,...;3 
T L 1 
(i) T-? »y £-181-1 > a i: [Wor)}- (WO) arh a 
T L 1 
() TY eS Al We drs 
T L 1. 
(k) T-? > t€,-,)-1 > af i (WO) [WO] ark ws 
() T-orD p vol(v +1) forv=0,1,2,.... 


548 Chapter 18 | Unit Roots in Multivariate Time Series 


18.2. Vector Autoregressions Containing Unit Roots 


Suppose that a vector y, could be described by a vector autoregression in the 
differences Ay,. This section presents results developed by Park and Phillips (1988, 
1989) and Sims, Stock, and Watson (1990) for the consequences of estimating the 
VAR in levels. We begin by generalizing the Dickey-Fuller variable transformation 
that was used in analyzing a univariate autoregression. 


An Alternative Representation of a VAR(p) Process 
Let y, be an (m x 1) vector satisfying 
(Il, — ®,L - ®,L? — +--+ — ®,L?)y, = a + &,, [18.2.1] 


where @, denotes an (nm X nm) matrix for s = 1, 2,..., p and @ and €, are 
(n X 1) vectors. The scalar algebra in [17.7.4] works perfectly well for matrices, 
establishing that for any values of ®,, ®,..., ®,, the following polynomials 
are equivalent: 


(I, — ®,L - ®,L? - --- - ®,L?) 18.2.2] 
= (I, — pL) — (Gb + 022 +--+ + €,.,L°-*\(1 — L), 
where : 
p=P,+@,+---+9@, [18.2.3] 
C.= -[4:+ By. t +--+ + ®,] fors =1,2,...,p—1. [18.2.4] 
It follows that any VAR(p) process [18.2.1] can always be written in the form 
(I, — pL)y, — (Gib + O07? +--+ + 0,-,Le-)(1 - Ly, = at e, 
or 
y, = CrAy,-; + CAy,-2 t°° + + 0,-1Ay-pa1 tat py,1+,. [18.2.5] 


The null hypothesis considered throughout this section is that the first dif- 
ference of y follows a VAR(p — 1) process; 


Ay, = GiAy,-1 + S2Ay,-2 + ++ + C,-1AY-par tat e,, [18.2.6] 
requiring from [18.2.5] that 
p=I, [18.2.7] 
or, from [18.2.3], 
®,+0,+---+0,=I1,. [18.2.8] 


Recalling Proposition 10.1, the vector autoregression [18.2.1] will be said to 
contain at least one unit root if the following determinant is zero: 


I, - ®,- @,---+- ® | =0. [18.2.9] 


Note that [18.2.8] implies [18.2.9] but [18.2.9] does not imply [18.2.8]. Thus, this 
section is considering only a subset of the class of vector autoregressions containing 
a unit root, namely, the class described by [18.2.8]. Vector autoregressions for 
which [18.2.9] holds but [18.2.8] does not will be considered in Chapter 19. 

This section begins with a vector generalization of case 2 from Chapter 17. 


18.2. Vector Autoregressions Containing Unit Roots 549 


A Vector Autoregression with No Drift in Any 
of the Variables 


Here we assume that the VAR [18.2.1] satisfies [18.2.8] along with a = 0 
and consider the consequences of estimating each equation in levels by OLS using 
observations ¢ = 1,2,... , Tand conditioning on yp,y_1,.. . ,y-»41- A constant 
term is assumed to be included in each regression. Under the maintained hypothesis 
[18.2.8], the data-generating process can be described as 


(I, — GL - 0,1? - +++ - £,-,L?"")Ay, = e,. [18.2.10] 
Assuming that all values of z satisfying 
II, — &1z2 — G27 - +--+ - tp-12" | =0 
lie outside the unit circle, [18.2.10] implies that 


Ay, = w, [18.2.11] 
where 
u, = (1, — 0,0 - OL? — +--+ — 6. ,L°-1)-1e,. 


If ©, is i.i.d. with mean zero, positive definite variance-covariance matrix Q = 
PP’, and finite fourth moments, then u, satisfies the conditions of Proposition 18.1 
with p 


WL) = (1, — OL — G0? - «++ — 0,78. [18.2.12] 
Also from [18.2.11], we have 
y=Yotu tut +4, 


so that y, will have the same asymptotic behavior as €, in Proposition 18.1. 

Recall that the fitted values of a VAR estimated in levels [18.2.1] are identical 
to the fitted values for a VAR estimated in the form of [18.2.5]. Consider the ith 
equation in [18.2.5], which we write as 


Yae= Cau,-1 + Cpu,-2 Del Ci p—1Ue-p4t + a, + Pi Yr-1 + Es, [18.2.13] 


where u, = Ay, and ;, denotes the ith row of ¢, fors = 1,2,...,p — 1. Similarly, 
p; denotes the ith row of p. Under the null hypothesis [18.2.7], p} = e/, where 
e; is the ith row of the (m X n) identity matrix. Recall the usual expression [8.2.3] 
for the deviation of the OLS estimate b; from its hypothesized true value: 


by — B= (2x,x;)"(x,8,), [18.2.14] 


where 2 denotes summation over ¢ = 1 through T. In the case of OLS estimation 
of [18.2.13], 


by-B=|. °: (18.2.15] 


550 Chapter 18 | Unit Roots in Multivariate Time Series 


=Xx,X; 


, , 
Duy Uy Du, Wy. °° + DU, 1Uy- ped DU, 4 Du, -i¥7-1 
, 
ZU, 20,4 2u,-20;-2 + Lup gs ZUy-2 DU, 297-1 
, , 
Uy peiUy— 1 DW pez °° * DO p 4 Weyer ZU, p41 VW p a Ye-1 
, , 
> =U; -2 pee >> aera T Lyi-1 
, 
ZY, —1Uy-1 Dy. LY-1Uj- p41 ZY,-1 Ly -1¥s-1 
[18.2.16] 
DU, 18; 
ZU, 28% 
=x,é, = : : [18.2.17] 
ZU 418i 
Zé, 
LY, 18: 


Our earlier convention would append a subscript T to the estimated coeffi- 
cients £,, in [18.2.15]. For this discussion, the subscript T will be suppressed to 
avoid excessively cumbersome notation. 

Define Y; to be the following matrix: 


Y 


T 
(np +1) x (np +1) 


= 0’ T? 9’ |. [18.2.18] 
0 0 TH, 
Premultiplying [18.2.14] by Y; and rearranging as in [17.4.20] results in 
Y7(by — B) = (¥7'Dx,x/¥7F')- {V7 42 xe). [18.2.19] 
Using results (a), (c), (d), (g), and (i) of Proposition 18.1, we find 
T~*Xu,_,u/_; T~'Zu,_,u/_2 
T~*Xu,-20;-1 T~*Xu,_2u;_2 


(Yp'Ex,x/Y¥7!) = ae: fd 
‘ TO Lu, p4eW;-1 TO 'Du,— 2410-2 
T-'Su/_, T-Xu!_, 
T-7*Zy,_ Wj-1 T~*?2y,—1U;-2 
T7'Zu,-10j-per T'S | TOE, 
T~*Xu,_2U;— p41 T~'Su,-2 T~*?ZU,-2Y1~1 
T~u,_ p+ iWs-p 41 T~ "Xu, p41 T 772, p+i¥i~1 
Tule p41 1 TOPS, ~1 
T~372y,_1Ws-pe1 T7772 y,-4 T~*2y,~1Y1—1 
c]V 0 
Bas , 18.2.20 
[° A 


18.2. Vector Autoregressions Containing Unit Roots 551 


where 


a Psy 

| Ieee r, etnity Tes. 
\ eer lar anes Ne [18.2.21] 

[n(p -— 1) x n(p — 1)] : : exe : 

| heer T_pas vo To 

I, = E(Ay,)(Ay,-s)’ 
1 [fwoa]'e 

[18.2.22] 
(n+1)x(n +1) ‘ 


A: { W(r) dr ad { (W)]-[W(r)]' arha’ 
Also, the integral sign denotes integration over r from 0 to 1, and 
A=(I, - % -& -— +--+ — €-1)7'P [18.2.23] 
with E(e,e;) = PP’. Similarly, applying results (a), (b), and (f) from Proposition 
18.1 to the second term in [18.2.19] reveals 
T7172 u,- 18% 
T~*?u,_ 28), 


* a4 _ : uc | hy 
(Y7*2x,) = 173th parts > RA [18.2.24] 
T- Xs, 
T ZY, ~ 18: 
where 
h, ~ N(O, a,;V) 
[n(@ ~1) x1] 


oy = Elsi, 


e; PW(1) 
h = 
[1+ 1) x1] a fiw) jawon'| re 


for e; the ith column of I,,. Results [18.2.19], [18.2.20], and [18.2.24] establish that 


. | Vth, 
Y7(b; - B)> bel [18.2.25] 
The first n(p — 1) elements of [18.2.25] imply that the coefficients on Ay,-;, 
Ay,-2,.- +, Ay;-p+1 converge at rate \/T to Gaussian variables: 
br ie Cn 

S22 — Sie Lay _ 

VT ; > V—h, ~ NO, o;-V~2). (18.2.26] 
uss - Cip-1 


This means that the Wald form of the OLS y? test of any linear hypothesis that 
involves only the coefficients on Ay,_, has the usual asymptotic y* distribution, as 
the reader is invited to confirm in Exercise 18.1. 


552 Chapter 18 | Unit Roots in Multivariate Time Series 


Notice that [18.2.26] is identical to the asymptotic distribution that would 
characterize the estimates if the VAR were estimated in differences: 


AYn = a; + Cy Ay,-1 + Ce Ay,-2 t+ °° + Gyr AY par t &u-  [18.2.27] 


Thus, as in the case of a univariate autoregression, if the goal is to estimate the 
parameters [,,, C2, - - - » $;,,-1 or test hypotheses about these coefficients, there 
is no need based on the asymptotic distributions for estimating the VAR in the 
difference form [18.2.27] rather than in the levels form, 


Ye = ChAy-1 + CpAy-2 +--+ + Sip -rAY—pai [18.2.28] 
+ Qa; + PIY,-1t Eins 


Nevertheless, the small-sample distributions may well be improved by estimating 
the VAR in differences, assuming that the restriction [18.2.8] is valid. 

Although the asymptotic distribution of the coefficient on y,_, is non-Gaussian, 
the fact that this estimate converges at rate T means that a hypothesis test involving 
a single linear combination of p,; and £;;, $2, . . .  $;,9-1 Will be dominated asymp- 
totically by the coefficients with the slower rate of convergence, namely, £;,, C2, 

. , Ci -1, and indeed will have the same asymptotic distribution as if the true 
value of p = I,, were used. For example, if the VAR is estimated in levels form 
[18.2.1], the individual coefficient matrices ®, are related to the coefficients for 
the transformed VAR [18.2.5] by 


@, = -i,-1 [18.2.29] 
$,=%-f., fors=2,3,...,p—-1 [18.2.30] 
&,= 6+ hb. [18.2.31] 


Since vVT(¢t,- t,) is asymptotically Gaussian and since p is O,(T~'), it follows 
that VT(®, — ®,) is asymptotically Gaussian for s = 1,2,... , p assuming that 
p = 2. This means that if the VAR is estimated in levels in the standard way, any 
individual autoregressive coefficient converges at rate \/T to a Gaussian variable 
and the usual ¢ test of a hypothesis involving that coefficient is asymptotically valid. 
Moreover, an F test involving a linear combination other than ®, + ®, + +--+ + 
®, has the usual asymptotic distribution. 
Another important example is testing the null hypothesis that the data follow 
a VAR(p,) with po = 1 against the alternative of a VAR(p) with p > po. Consider 
OLS estimation of the ith equation of the VAR as represented in levels, 
Ya = & + Day, + Opy,-2 +--+ + P,Y:-p + E44, [18.2.32] 
where ®;, denotes the ith row of ®,. Consider the null hypothesis 
Hg: Py p41 = Piper = 1° * = By = 0. [18.2.33] 


The Wald form of the OLS y? test of this hypothesis will be numerically identical 
to the test of 
Ho: Gig = biped = °° ° = Sip-1 = 0 (18.2.34] 
for OLS estimation of 
Ya = CnAy,-1 CpAy,-2 ++ °° + Bip 1AYe—p41 [18.2.35] 
+ a; + PY¥i-1 + &x- 


Since we have seen that the usual F test of [18.2.34] is asymptotically. valid and 
since a test of [18.2.33] is based on the identical test statistic, it follows that the 
usual Wald test for assessing the number of lags to include in the regression is 
perfectly appropriate when the regression is estimated in levels form as in [18.2.32]. 


18.2. Vector Autoregressions Containing Unit Roots 553 


Of course, some hypothesis tests based on a VAR estimated in levels will not 
have the usual asymptotic distribution. An important example is a Granger-causality 
test of the null hypothesis that some of the variables in y, do not appear in the 
regression explaining y,,. Partition y, = (y{,, y2,)’, Where y2, denotes the subset of 
variables that do not affect y,, under the null hypothesis. Write the regression in 
levels as 


Vie = @O1Yry-1 + AY2e-1 + W2Yiy-2 + MY2y-2 ++ - 


+ ObYii-p + ADY2r-p + % + Fu [18.2.36] 
and the transformed regression as 
Ya = BiAyiy-1 + YiAYes-1 + B2Ayiy-2 + Y2AYay-2 t+ 
+ Bp-AYis-per t Yp-1AYor-por + % + Wa 0-1 [18.2.37] 
+ B’yo,-3 + Ey. 
The F test of the null hypothesis A, = A, = - - - = A, = 0 based on OLS estimation 


of [18.2.36] is numerically identical to the F test of the null hypothesis y, = y, = 
+++ = y,-, = & = 0 based on OLS estimation of [18.2.37]. Since & has a non- 
standard limiting distribution, a test for Granger-causality based on a VAR esti- 
mated in levels typically does not have the usual limiting y? distribution (see Ex- 
ercise 18.2 and Toda and Phillips, 1993b, for further discussion). Monte Carlo 
simulations by Ohanian (1988), for example, found that if an independent random 
walk is added to a vector autoregression, the random walk might spuriously appear 
to Granger-cause the other variables in 20% of the samples if the 5% critical value 
for a x? variable is mistakenly used to interpret the test statistic. Toda and Phillips 
(1993a) have an analytical treatment of this issue. 


A Vector Autoregression with Drift in Some of the Variables 
Here we again consider estimation of a VAR written in the form 
y, = GiAy,-1 + GAy,-2 +--+ +6,-1Ay-par HOt PY-1 + &. [18.2.38] 
As before, it is assumed that roots of 
I, — Giz — bez? - +--+ - biz" 4] = 0 


are outside the unit circle, that €, is i.i.d. with mean zero, positive definite variance 
Q, and finite fourth moments, and that the true value of p is the (7 x 7) identity 
matrix. These assumptions imply that 


Ay, = 8 +4, [18.2.39] 
where 
8=(,,-%-G---:- &-1)%@ [18.2.40] 
u, = W(L)e, [18.2.41] 
WL) =, - GL - Gb - --- - Gb). 


In contrast to the previous case, in which it was assumed that 5 = 0, here we 
suppose that at least one and possibly all of the elements of 6 are nonzero. 

Since this is a vector generalization of case 3 for the univariate autoregression 
considered in Chapter 17, one’s first thought might be that, because of the nonzero 
drift in the /(1) regressors, if all of the elements of 5 are nonzero, then all the 
coefficients will have the usual Gaussian limiting distribution. However, this turns 
out not to be the case. Any individual element y, of the vector y, is dominated by 


554 Chapter 18 | Unit Roots in Multivariate Time Series 


a deterministic time trend, and if y, appeared alone in the regression, the asymptotic 
results would be the same as if y, were replaced by the time trend ¢. Indeed, as 
noted by West (1988), in a regression in which there is a single /(1) regressor with 
nonzero drift and in which all other regressors are (0), all of the coefficients would 
be asymptotically Gaussian and F tests would have their usual limiting distribution. 
This can be shown using essentially the same algebra as in the univariate auto- 
regression analyzed in case 3 in Chapter 17. However, as noted by Sims, Stock, 
and Watson (1990), in [18.2.38] there are n different (1) regressors (the m elements 
of y,_,), and if each of these were replaced by 6,¢ — 1), the resulting regressors 
would be perfectly collinear. OLS will fit n separate linear combinations of y, so 
as to try to minimize the sum of squared residuals, and while one of these will 
indeed pick up the deterministic time trend ¢, the other linear combinations cor- 
respond to /(1) driftless variables. 

To develop the correct asymptotic distribution, it is convenient to work with 
a transformation of [18.2.38] that isolates these different linear combinations. Note 
that the difference equation [18.2.39] implies that 


Yy,=Yot St +u, tu, t--- + uy, (18.2.42] 


Suppose for illustration that the mth variable in the system exhibits nonzero drift 
(5, * 0); whether in addition 6, # 0 fori = 1,2,...,m — 1 then turns out to 
be irrelevant, assuming that [18.2.8] holds. Define ‘ 


Yi = yu ~ (6,/8,) ¥nu 
yd = Ya — (82/5,) Yue 


Yan = Yn-1s — (8, - 1/84) Yn 


Yau = Yne- 
Thus, fori = 1,2,...,"- 1, 
yi = [Yo + Ot + Uy t Ug to + + Uy] 
— (6/5,)Lyno + Spt + Unt + Ung toot + Un] 
=yn t+ EF, 


where we have defined 
ya = [Yo — (8/8,) yn] 


Ef mu, tug t-+: + ug 
ui = ui, — (5,/5, un- 
Collecting uf, uj, ..., U3_,, in an [(n — 1) X 1] vector u?, it follows from 


(18.2.41] that 
us = W*(Lie,, 
where W*(L) denotes the following [(m — 1) x n] matrix polynomial: 
w'*(L) = H-W(L) 


for 
100 0 -(6,/6,) 
as 01 ? ots : ~ (/6,) 
((n-1) xa] bo Te ae 7 
000 -:--- 1 —(,-3/6,) 


18,2. Vector Autoregressions Containing Unit Roots 555 


Since {s-W,}%_, is absolutely summable, so is {s-W*}¥_,. Hence, the [(m — 1) 


X 1] vector y7 = (yi, y3,, ---» Ya-1,)' has the same asymptotic properties as 
the vector &, in Proposition 18.1 with the matrix W(1) in Proposition 18.1 replaced 
by W*(1). 


If we had direct observations on y* and u,, the fitted values of the VAR as 
estimated from [18.2.38] would clearly be identical to those from estimation of 


y, = Cie. + Guz t +--+ + O-w-pa1 + a 


2.43 
+ ptyfii t+ VYne-1 + & [18 


where p* denotes an [n xX (nm — 1)] matrix of coefficients while y is an (n x 1) 
vector of coefficients. This representation separates the zero-mean stationary re- 
gressors (u,_, = Ay,_, — 8), the constant term (a*), the driftless /(0) regressors 
(y*), and a term dominated asymptotically by a time trend (y,,,-). As in Section 
16.3, once the hypothetical VAR [18.2.43] is analyzed, we can infer the properties 
of the VAR as actually estimated ([18.2.38] or [18.2.1]) from the relation between 
the fitted values for the different representations. 
Consider the ith equation in [18.2.43], 


- as r bas 7 , 
Yi = CnW-1 + Cpty-2 + + Ch p-1Ws-par + a? 


18.2.44 
+ Pi yn, + Vi¥na-2 + Ens 
where {/, denotes the ith row of ¢, and p}’ is the ith row of p*. Define 
° x} = (uj-1, Uj-2, very Uy pets 1, Yre4, Yna-1) 
[(np +1) x1] 
TY Iyp-1) 9 0 0 
0’ rT? QO 0 
Y = 18.2.45 
((np +1) x (np +1)] 0 0 T 1-1 0 [ ] 
0’ 0 0’ T?2 
A* = W*(1)-P, 
[(n=1) xn] 
where E(e,e;) = PP’. Then, from Proposition 18.1, 
T 
(ve ps anor ys) [18.2.46] 
t=1 
Vv 0 0 0 
0’ 1 ll W(r) a -A* 6,,/2 


{re 


0 At { W(r) dr ar [W(r) |W] arh.ae 8,-A* { rW(r) dr 


0° 8,/2 a. i rW(r) a “AY 82/3 
where 
a ee ee 
T_ I, see Tol 
a Ys a (18.2.47] 


[t(p —1) xn(p —1)] 


T_i+2 Toss. ee YT, 


556 Chapter 18 | Unit Roots in Multivariate Time Series 


and W(r) denotes n-dimensional standard Brownian motion while the integral sign 
indicates integration over r from 0 to 1. Similarly, 


hy 
T 
h 
Y¥7! > xt]? |, [18.2.48] 
ist h; 
hg 


where h, ~ N(0, o,,V). The variables A. and A, are also Gaussian, though h; is 
non-Gaussian. If we define w to be the vector of coefficients on lagged Ay, 
w = (Ch, ba, - 5 bin-1)’s 

then the preceding results imply that 
T*?(@7 — o) 

Tair — at) | ¢ peel 
T(6ir — p?) Q-my]’ 
T?(4,7 - ») 

where y = (Az, hj, A,)’ and Q is the [(m + 1) x (nm + 1)] lower right block of the 


matrix in [18.2.46]. Thus, as usual, the coefficients on u,_, in [18.2.43] are asymp- 
totically Gaussian: 


Y7(b} — B*) = [18.2.49] 


VT(a;,r — ©) 4 N(O, o,;V~'). 


These coefficients are, of course, numerically identical to the coefficients on Ay,_, 
in [18.2.38]. Any F tests involving just these coefficients are also identical for the 
two parameterizations. Hence, an F test about £,, f,. . . . ¢,-1 in [18.2.38] has 
the usual limiting x? distribution. This is the same asymptotic distribution as if 
[18.2.38] were estimated with p = L, imposed; that is, it is the same asymptotic 
distribution whether the regression is estimated in levels or in differences, 

Since pF and ¥;converge at a faster rate than & 7, the asymptotic distribution 
of a linear combination of ®,, 67, and Y that puts nonzero weight on @, has the 
same asymptotic distribution as a linear combination that uses the true values for 
p and yy. This means, for example, that the original coefficients ®, of the VAR 
estimated in levels as in [18.2.1] are all individually Gaussian and can be interpreted 
using the usual ¢ tests. A Wald test of the null hypothesis of py = 1 lag against the 
alternative of p > po lags again has the usual x? distribution. However, Granger- 
causality tests typically have nonstandard distributions. 


18.3. Spurious Regressions 
Consider a regression of the form 
¥ = XB + U,, 

for which elements of y, and x, might be nonstationary. If there does not exist some 
population value for B for which the residual u, = y, — x;B is /(0), then OLS is 
quite likely to produce spurious results. This phenomenon was first discovered in 
Monte Carlo experimentation by Granger and Newbold (1974) and later explained 
theoretically by Phillips (1986). 

A general statement of the spurious regression problem can be made as 
follows. Let y, be an (n x 1) vector of /(1) variables. Define g = (n — 1), and 


18.3. Spurious Regressions 5§7 


partition y, as 


= e “| 
y: = 3 
Ya 


where y>, denotes a (g x 1) vector. Consider the consequences of an OLS regression 
of the first variable on the others and a constant, 


Yu = @ + Y'Ya + u,. [18.3.1] 
The OLS coefficient estimates for a sample of size T are given by 
& Psy | 7s 
ba = a | sa |: [18.3.2] 
Yr Zoe Ly2¥xJ LZYaxu 


where © indicates summation over ¢ from 1 to T. It turns out that even if y,, is 
completely unrelated to y,,, the estimated value of y is likely to appear to be 
statistically significantly different from zero. Indeed, consider any null hypothesis 
of the form Hy: Ry = r where R is a known (mm X g) matrix representing m 
separate hypotheses involving yy and r is a known (m X 1) vector. The OLS F test 
of this null hypothesis is 


-1 -1 
T Ly, 0’ 
Fy = {R¥r — r}'}s}-[0 R 
asa {or 7 ee ia [18.3.3] 


x {RYr as r} - im, 


where 
T 
s}=(T — n)-! >) a2. [18.3.4] 
t1 


Unless there is some value for y such that y,, ~ y'y2, is stationary, the OLS estimate 
‘7, will appear to be spuriously precise in the sense that the F test is virtually certain 
to reject any null hypothesis if the sample size is sufficiently large, even though 
r does not provide a consistent estimate of any well-defined population constant! 

The following proposition, adapted from Phillips (1986), provides the formal 
basis for these statements. 


Proposition 18.2: Consider an (n X 1) vector y, whose first difference is described 
by 


Ay, = W(L)e, = > W,e,-, 


for €, an i.i.d. (n X 1) vector with mean zero, variance E(e,£/) = PP’, and finite 
fourth moments and where {s-W,}7_ is absolutely summable. Let g = (n — 1) and 
A = W(1)-P. Partition y, as y, = (Yu, Yx)', and partition AA' as 


> 21 


ul 
AA’ = | G1) (xa) |. [18.3.5] 
xm | Sa Zp | 

(gx1) (gx) 


Suppose that AA' is nonsingular, and define 
(of)? = (uy — 2y2n'Z2). [18.3.6] 
Let Lz, denote the Cholesky factor of 33'; that is, Lz, is the lower triangular matrix 


558 Chapter 18 | Unit Roots in Multivariate Time Series 


Satisfying 
23! = LyLy. [18.3.7] 
Then the following hold. 


(a) The OLS estimates &, and Yr in [18.3.2] are characterized by 


T-*6, |4| oth, [18.3.8] 
Vr — THE, at Lazh, |’ Z 
where 
OP [owsera | 
ha { W3(r) dr { [W3(r)]:[WF()]' dr [18.3.9] 
[ wre dr 


x 
{ W3(r):-Wi(r) dr 


and the integral sign indicates integration over r from 0 to 1, Wi(r) denotes 
scalar standard Brownian motion, and W3(r) denotes g-dimensional standard 
Brownian motion with W3(r) independent of W}(r). 

(b) The sum of squared residuals RSS; from OLS estimation of [18.3.1] satisfies 


T-?-RSS > (o¢)?-H, [18.3.10] 
where 


a=[wrore-I|fwroa | wrontwz | 


-1 


1 | [W3(r)]' dr | Whar 
x 
[weer fowsorowser ar] | fiweorwrole 
[18.3.1]] 
(c) The OLS F test [18.3.3] satisfies 
T-hF, 3 {of Reh, — r*} x 4 (of)?-H[O R*] 


[3 [weer a 
[ wre ar | eweertwroy a 


x {of-R*h, — r*} + m, 


ba (18.3.12] 


where 
R* = RL, 
re=r— RY5'S. 


18.3. Spurious Regressions 559 


The simplest illustration of Proposition 18.2 is provided when y,, and y2, are 
scalars following totally unrelated random walks: 


Yu = Yiy-1 t Fx [18.3.13] 


Yor = You-1 + Ex» [18.3.14] 


where ¢,, is i.i.d. with mean zero and variance g?, ¢,, is i.i.d. with mean zero and 
variance a3, and &,, is independent of &2, for all and +. For y, = (yz, Ya)’, this 
specification implies 


Wi) =-L 
2 2a at 0 
= W(1):P-P’-[W(1)]’ = 
Ee 222 (1) [va] 0 «3 
at =a, 
: La = lla. 


Result (a) then claims that an OLS regression of y,, on y2, and a constant, 
Vy = a+ Vyo, + U,, ; [18.3.15] 


produces esfimates &, and +, characterized by 


ee Plea 
A = ‘J 
Yr (a,/a2)-h2 


Note the contrast between this result and any previous asymptotic distribution 
analyzed. Usually, the OLS estimates are consistent with b; -4 0 and must be 
multiplied by some increasing function of T in order to obtain a nondegenerate 
asymptotic distribution. Here, however, neither estimate is consistent—different 
arbitrarily large samples will have randomly differing estimates 7,. Indeed, the 
estimate of the constant term @, actually diverges, and must be divided by T’? to 
obtain a random variable with a well-specified distribution—the estimate @ itself 
is likely to get farther and farther from the true value of zero as the sample size 
T increases. 

Result (b) implies that the usual OLS estimate of the variance of u,, 


5}. = (T — n)-“RSS;, 
again diverges as T-> ~. To obtain an estimate that does not grow with the sample 
size, the residual sum of squares has to be divided by T? rather than T. In this 
respect, the residuals 2, from a spurious regression behave like a unit root process; 
if &, is a scalar [(1) series, then T~*2£? diverges and T-?2é? converges. To see 
why a, behaves like an /(1) series, notice that the OLS residual is given by 

8, = Yu — br — VrVes 
from which 


ry ‘, Ld A L , : 
Ad, = Ayy ~ *rAys, = [1 - +43] i ‘| > [1 -hf'JAy,, [183.16] 
2r. 


where h} = 23'2,, + ofL,2h,. This is a random vector[1 —h}’] times the /(0) 
vector Ay,. 


560 Chapter 18 | Unit Roots in Multivariate Time Series 


Result (c) means that any OLS t or F test based on the spurious regression 
[18.3.1] also diverges; the OLS F statistic [18.3.3] must be divided by T to obtain 
a variable that does not grow with the sample size. Since an F test of a single 
restriction is the square of the corresponding ¢ test, any ¢ statistic would have to 
be divided by 7’? to obtain a convergent variable. Thus, as the sample size T 
becomes larger, it becomes increasingly likely that the absolute value of an OLS 
t test will exceed any arbitrary finite value (such as the usual critical value of 
t = 2), For example, in the regression of [18.3.15], it will appear that y,, and y,, 
are significantly related whereas in reality they are completely independent. 

In more general regressions of the form of [18.3.1], Ay,, and Ay, may be 
dynamically related through nonzero off-diagonal elements of P and W(L). While 
such correlations will influence the values of the nuisance parameters a}, 2,,, and 
222, provided that the conditions of Proposition 18.2 are satisfied, these correlations 
do not affect the overall nature of the results or rates of convergence for any of 
the statistics. Note that since W7(r) and W3(r) are standard Brownian motion, the 
distributions of 4,, h,, and H in Proposition 18.2 depend only on the number of 
variables in the regression and not on their dynamic relations. 

The condition in Proposition 18.2 that A-A’ is nonsingular might appear 
innocuous but is actually quite important. In the case of a single variable (y, = yu, 
with Ay,, = y(L)e,,), the matrix A-A’ would just be the scalar [y(1)-o,]* and the 
condition that A-A’ is nonsingular would come down to the requirement that (1) 
be nonzero. To understand what this means, suppose that y,, were actually sta- 
tionary with Wold representation: 


Yue = by + Cyereny + Cotye-g t+ = C(L)ex. 
Then the first difference Ay,, would be described by 
Ay, = (1 — L)C(L)ey, = W(Lex, 


where y(L) = (1 —L)C(L), meaning y{1) = (1 — 1)-C(1) = 0. Thus, if y,, were 
actually J(0) rather than (1), the condition that A-A’ is nonsingular would not be 
satisfied. 

For the more general case in which y, is an (m X 1) vector, the condition that 
A-A’ is nonsingular will not be satisfied if some explanatory variable y,, is (0) or 
if some linear combination of the elements of y, is /(0). If y, is an I(1) vector but 
some linear combination of y, is J(0), then the elements of y, are said to be coin- 
tegrated. Thus, Proposition 18.2 describes the consequences of OLS estimation of 
[18.3.1] only when all of the elements of y, are J(1) with zero drift and when the 
vector y, is not cointegrated. A regression is spurious only when the residual u, is 
nonstationary for all possible values of the coefficient vector. 


Cures for Spurious Regressions 


There are three ways in which the problems associated with spurious regres- 
sions can be avoided. The first approach is to include lagged values of both the 
dependent and independent variable in the regression. For example, consider the 
following model as an alternative to [18.3.15]: 


Yue = & + PYy ca + Woe + SYar-a + Uy. [18.3.17] 


This regression does not satisfy the conditions of Proposition 18.1, because there 
exist values for the coefficients, specifically @ = 1 and y = 6 = 0, for which 
the error term u, is 1(0). It can be shown that OLS estimation of [18.3.17] 
yields consistent estimates of all of the parameters. The coefficients 4 and 8, each 


18.3. Spurious Regressions 561 


individually converge at rate \/T to a Gaussian distribution, and the t test of the 
hypothesis that y = 0 is asymptotically N(0, 1), as is the ¢ test of the hypothesis 
that 6 = 0. However, an F test of the joint null hypothesis that y and 6 are both 
zero has a nonstandard limiting distribution; see Exercise 18.3. Hence, including 
lagged values in the regression is sufficient to solve many of the problems associated 
with spurious regressions, although tests of some hypotheses will still involve non- 
standard distributions. 

A second approach is to difference the data before estimating the relation, 
as in 


Ayu = a@ + yAyo, + uy. [18.3.18] 


Clearly, since the regressors and error term u, are all J(0) for this regression under 
the null hypothesis, &; and #7 both converge at rate VT to Gaussian variables. 
Any tor F test based on [18.3.18] has the usual limiting Gaussian or y? distribution. 

A third approach, analyzed by Blough (1992), is to estimate [18.3.15] with 
Cochrane-Orcutt adjustment for first-order serial correlation of the residuals. We 
will see in Proposition 19.4 in the following chapter that if 2, denotes the sample 
residual from OLS estimation of [18.3.15], then the estimated autoregressive coef- 
ficient p; from an OLS regression of 2, on #,_, converges in probability to unity. 
Blough showed that the Cochrane-Orcutt GLS regression is then asymptotically 
equivalent to the differenced regression [18.3.18]. 

Because the specification [18.3.18] avoids the spurious regression problem as 
well as the nonstandard distributions for certain hypotheses associated with the 
levels regression [18.3.15], many researchers recommend routinely differencing 
apparently nonstationary variables before estimating regressions. While this is the 
ideal cure for the problem discussed in this section, there are two different situations 
in which it might be inappropriate. First, if the data are really stationary (for 
example, if the true value of ¢ in [18.3.17] is 0.9 rather than than unity), then 
differencing the data can result in a misspecified regression. Second, even if both 
yx, and y,, are truly (1) processes, there is an interesting class of models for which 
the bivariate dynamic relation between y, and y, will be misspecified if the re- 
searcher simply differences both y, and y. This class of models, known as coin- 
tegrated processes, is discussed in the following chapter. 


APPENDIX 18.A. Proofs of Chapter 18 Propositions 
@ Proof of Proposition 18.1. 
(a) This follows from [18.1.7] and [18.1.8] with r = 1. 


(b) The derivation is identical to that in [11.A.3]. 
(c) This follows from Proposition 10.2(d). 
(d) Note first in a generalization of [17.1.10] and [17.1.11] that 
T T T 
>> &&; = > (&-1 + u,)(&—1 + u,)’ = 2 (& 187-1 5 &,_.u; + ue1 + uu,), 
so that 
T T ar: T 
2 (Ew + wB1) = 2&8 — 2D (6-8/1) — 2 (aan) 
i : 
= €7&, — Eg — > (u,u;) {18.A.1] 
a 
= €r&> - 2 (u,u;). 


562 Chapter 18 | Unit Roots in Multivariate Time Series 


Dividing by 7, 


Ts > (& 1) + u,é/_,) = T-'&é, - To s u,ul. {18.A.2] 


eal 
But from [18.1.7], €; = 7-X,(1). Hence, from [18.1.8] and the continuous mapping theorem, 
T€,8) = [VEX ()] [VE-Xr()I 5 ATWO)HW YA. [18.4.3] 
Substituting this along with result (c) into [18.A.2] produces 


TT > (10) + u€.,) 3 A[W()}-[Wd)]}-A’ - Pa, [18.A.4] 


which establishes result (d) for s =0. 
For s > 0, we have 


TD & 10), + wy-.8/-1) 


resel 


¥ i 
T 2» [G-.-1 + Uo, + Unga too + UU, 


rust 


+ u,-(& 5-1 + uy, + Uy-s41 + 7%; S + u;_i)] 
a 
=T"! 2 (Gist -s + u,_,&-5-1) 
T. 
+77 D [(uaa.) + (Wseitlg) t+ + (U)s0y-.) 
f=estl 
+ (u,_,U;_,) + (u,_,U;-s41) saa (u,-,U;-1)] 
L 7 a 
> A-[W(1)]}-[W(1)]"A’ ~ Fo 
+ (Pot Prtce + + Pe + Po + Pi t+ + Peal, 


by virtue of [18.A.4] and result (c). 
(e) See Phillips (1988). 
(f) Define &* = e, + & +--+ + &, and E(e,e,) = PP’. Notice that result (e) 


implies that 
T-1 > tte) 5 p{ : [wn] fawonr'}. {18.A.5] 


For & = u,; + u, + -*- + U,, equation [18.1.6] establishes that 


Tr 
ToD G18: 
fet 


T-* 2, {H(1)-Er i-t — No} e; 
af (1-1 + 0 No}é [18.4.6] 


i] 


rT T 
W(1)-7-! 2 e* je) + To! > (m-1 — To)"8/- 


But each column of {(m,_1 — mo)"€,}7., is a martingale difference sequence with finite 
variance, and so, from Example 7.11 of Chapter 7, 


T 
TD (iar — me FO [18.4.7] 
t=l 
Substituting [18.A.5] and [18.A.7] into [18.A.6] produces 


T 1 
TD ee 5 wor [wo fawn be, 
as claimed. 

(g) This was shown in [18.1.9]. 
(h) As in [17.3.17], we have 

T T T 

PAS tas yr) in, 
rat rl fel 


Appendix 18.A. Proofs of Chapter 18 Propositions 563 


or 


T T T 1 
T-? Stu, = T-'2 Yu, — T-* D&S AW) - A [ W(r) dr, [18.A.8] 
t= tl 


fel 


from results (a) and (g). This establishes result (h) for s = 0. The asymptotic distribution 
is the same for any s, from simple adaptation of the proof of Proposition 17.3(g). 


(i) As in [17.3.2], 


72 B81 = | IVEXAOMVEX AO) ar 


= a [ wonwor ar} 


(j), (k), and (1) parallel Proposition 17.3(i), (j), and (Kk). ™ 


w@ Proof of Proposition 18.2. The asymptotic distributions are easier to calculate if we work 
with the following transformed variables: 


yt = Yu — ZyBa' ya [18.A.9] 
yz = Lyx. [18.A.10] 


Note that the inverses 23', (o%)~!, and Lz! all exist, since AA’ is symmetric positive 
definite. An OLS regression of y?, on a constant and y3,, 


o yi = at + y"y3, + ur, [18.A.11] 
would yield estimates 
at T sys | '[ Bt 
Hie eb eee| fsa 
Yr Zy2, LYZY% ZynYi, 


Clearly, the residuals from OLS estimation of [18.A.11] are identical to those from OLS 
estimation of [18.3.1]: 


Yu — &p — YrYa = yt — OF — VP Ya 

= (yu — 22a! yu) — €7 — Yr! (Linyn) 

= yy — @p — {FP Le + DnZa'hya- 
The OLS estimates for the transformed regression [18.A.11] are thus related to those of 
the original regression [18.3.1] by 

a, = at 

4, = Lnft + 33'2n, [18.A.13] 
implying that 
= La'¥r — La'2a'2n 
= L597 - La'(LnLt,)Ea [18.A.14] 
L3'¥r — L221. 


2 
ye 


The usefulness of this transformation is as follows. Notice that 


ka 7 ice aie cae =L'y 
yz 0 Ly Yzz, ' 


for 


ua | Wet) a cas 
0 Li, ; 


564 Chapter 18 | Unit Roots in Multivariate Time Series 


Moreover, 


pant 3 |G CM ta [| (Vo?) 0" 
0 Ly Za 2nJ|L(-Vot)3e'2,, Ly 


_ (Ca 0’ l| (1/o*) | 


Li2y, Lyd» }L(-Vot)-23'2., Ly 
a ee — 23,23'2y (oT? 0’ 
0 L3,3..L2 |" 
[18.A.15] 
But [18.3.7] implies that 
22 = (Leb)! = (Ly) 'Lz!, 
from which 
Ly 2b. = L3,{(Li2)~'L3'}L = I,. 
Substituting this and [18.3.6] into [18.A.15] results in 
L’AA'L = I,. [18.A.16] 


One of the implications is that if W(r) is n-dimensional standard Brownian motion, 
then the n-dimensional process W*(r) defined by 


W*(r) = L'A-W(r) [18.A.17] 


is Brownian motion with variance matrix L’AA’L = E,. In other words, W*(r) could also 
be described as standard Brownian motion. Since result (g) of Proposition 18.1 implies that 


T 1 
TS y,Sa[ wey dr, 
rel 


it follows that 


ee = T7322 . , ky { = iA * 
T-323y% =T 2 L’'y,—> L'A A W(r) dr = ; W*(r) dr. [18.A.18] 


Similarly, result (i) of Proposition 18.1 gives 
T2(yi ot y To *Zytyz lot 
T*Xynyilof = T~*2yhyz! 


T 
= LT? D yy h 
dal) [18.A,19] 


4 va [ (WOW arhay 


= [ weonweor ar. 


It is now straightforward to prove the claims in Proposition 18.2. 


Proof of (a). If [18.A.12] is divided by of and premultiplied by the matrix 
T- V2 0’ 
0 I} 


Appendix 18.A. Proofs of Chapter 18 Propositions 565 


the result is 
T-2 0 |[ ato? 
0 L1Lefot 
_fr-72 olf rays) 7 ‘fre oo J ‘fr oot Uf Sytlot 
0 L,|L2yz Zyhyi’ 0 T-"7l, 0 T-71, |L2Zyzytlot 


_ (free oo UP rt sy [fre o)\ "(fr 0 7 sytvot 
0 7-4,]L3yz Bytyz Lo 1, 0 7-4, || SyZyt/ot 


T-"@tlot] [ 1 T-3y8! | ‘| T-23ytlot 
yoy | [7-sxys T-By8yt'] LT ByLytet 


or 


i [18.A.20] 


Partition W*(r) as 


bane 
W*(r) = : 
ee W3(r) 

@x1) 


Applying [18.A.18] and [18.A.19] to [18.A.20] results in 
-1 
eee | e 1 [ weer dr [wre dr 
A * 
Wet Tl lwsea [owrorwiore| [[wre-wrmdr] 08421 
= hy 
h, |’ 

Recalling the relation between the transformed estimates and the original estimates given 
in [18.A.14], this establishes that 


T-""4 lot | A 
(Wot): [La'¥r - L221] h, | 


of =—s«O0’ 
0 of L. 


and recalling [18.3.7] produces [18.3.8]. 


Premultiplying by 


Proof of (b). Again we exploit the fact that OLS estimation of [18.A.11] would produce 
the identical residuals that would result from OLS estimation of [18.3.1]. Recall the expres- 
sion for the residual sum of squares in [4.A.6]: 


-1 
nl TO zyx 2yt 
R881 = B94) ~ | Bot 2ytae| sr Sys | | }} 
2 


22 Lyiy2 
T2 9 
=Z(yt)?— 7 [evi Zyhyz’] 
0 I, 
Re Gea To sys! [fr o]\ [r-32 0’ VP aye 
0 TA, |L2yx Zyzyz 0 I, 0 TAA, | Ley uy : 


[18.A.22] 


566 Chapter 18 | Unit Roots in Multivariate Time Series 


If both sides of [18.A.22] are divided by (T-o7)?, the result is 
T-?-RSS,Katy 


= T-*3(yt/ot} tr~sozer T-*E(ytot)ys'] 
x| 1 T-23y%' | "| T-323ytiot 
T-Zyi, T-*ZyZyx | LT-*2yhy,/ot 

+s [ower ar - [free a [oronwsor «| 


1 if [wi()]' ar | i W3(r) dr 
“Tfwse de fowserowecr ar} | fiwee@rtwren ar 


Proof of (c), Note that an F test of the hypothesis Hj: Ry = r for the original regression 
[18.3.1] would produce exactly the same value as an F test of R*y* = r* for OLS estimation 
of [18.A.11], where, from [18.A.13], 

Ry ~ r= RiLyy* + Za'Ea} — = Rey* — 
for 


R* =R-L, [18.A.23] 
rear — RIz'22). [18.A.24] 


The OLS F test of R*y* = r* is given by 
Fr = {R*47 - rt} 


T osyz Voy 
x {tr Rifas ce [-|} ek le 
T--F, = (R*9$ — et} 


. verre mY rales a] 


ale {R*97 -—r*} sm [18.A.25] 


from which 


= {Rh - ae R’] 


=1 ~t 
1 T~ 3! 0 : 
* Lak | lt soa 


But 
eR = 7 ~ mS cary = (r=) a, 


Appendix 18.A. Proofs of Chapter 18 Propositions 567 


and so, from result (b), 
T-'[st? = [TAT — n)]-T-2RSS; 3 (o*)*H. [18.A.26] 
Moreover, [18.A.18] and [18.A.19] imply that 


1 oresays | [ wroyr a 
T-*3ys T-2Zyty%! io 2 - Fy ‘ ’ [18.A.27] 
e ae Wi(r) dr | [Wi()]-[W2@)]' ar 
while from [18.A.21], 
43> of-hy. [18.A.28] 


Snbstituting [18.A.26] through [18.A.28] into [18.A.25], we conclude that 


T-'-F, {of-R*h, — e*}’ x {(or-z0 R*] 
1 W3(r)]' dr ef ,7)7 
x f ol Ba {o*-R*h, — r*} +m. 
[ws@ ar fows@rtwzer dr] Le 


ma 


Chapter 18 Exercises 


18.1. Consider OLS estimation of 

Yo = SnAy,-1 + SpAy,2 tort + Gip-rAY,-par +O + PIX A-1 + Ex, 
where y,, is the ith element of the (n x 1) vector y, and «, is the ith element of the 
(n x 1) vector €,. Assume that ¢, is i.i.d. with mean zero, positive definite variance 2, and 
finite fourth moments and that Ay, = W(L)e,, where the sequence of (mn x n) matrices 
{s‘W,}°_, is asolutely summable and W(1) is nonsingular. Let k = np + 1 denote the number 
of regressors, and define 


x, = (Ay;-1, Ay;-2 8808 'g) AY:-p+1> 1, Yr-1)’- 
Let b; denote the (kK x 1) vector of estimated coefficients: 
b; = (2x,x;)~"(2x, yin) 


where = denotes summation over ¢ from 1 to T. Consider any null hypothesis Hj: RB = r 
that involves only the coefficients on Ay,_,—that is, R is of the form 


R =|] R, 0 | 
(mn xk) [mxn(p-1)} [mx (+n) 


Let y2 be the Wald form of the OLS x? test of Hy: 
x7 = (Rb; — r)'[s?R(2x,x,)~'R']- (Rb; — 1), 
where 
st = (T — k)“"X(y, — brx,). 
Under the maintained hypothesis that a, = 0 and p; = e; (where e; denotes the ith row 
of I,), show that v3. > y2(m). 
18.2. Suppose that the regression model 
Ya = ChAy,-1 + SpAy,-2 + ++ + Sip AY,-par + Oi + PIY,-1+ Ex 


568 Chapter 18 | Unit Roots in Multivariate Time Series 


satisfies the conditions of Exercise 18.1. Partition this regression as in [18.2.37]: 
Yn = Bay + VAYa-1 + BrAyi-2 + YW2AY-2 + °° 
+ By-AYs-pet + Yp-tAY2u-pat + + OWYi-1 
+ B’ys,-1t ens 


where y,, is an (m, X 1) vector ~~ yz, is an (m, X 1) vector with m, + nm, = n. Consider 
the null hypothesis y, = y2 = = y,-. = 5 = 0. Describe the asymptotic distribution 
of the Wald form of the OLS y? test of this null hypothesis. 


18.3. Consider OLS estimation of 


Yu = Ay, + a + OYy-1 + MY2e-1 + Us 


where y,, and y,, are independent random walks as specified in [18.3.13] and [18.3. 14]. Note 
that the fitted valnes of this regression are identical to those for [18.3.17] with d;, 77, and 
¢, the same for both regressions and 6; = #7, — #r. 

(a) Show that 


TF vy 
T°6, L |v. 

. => y 
T(¢r — 1) V3 
Tir Vs 


where v, ~ N(0, oi/03) and (v2, v3, ¥4)’ has a nonstandard limiting distribution. Conclude that 
In ay, br, and 7 fr are consistent estimates of 0, 0, 1, and 0, respectively, meaning that all 
of the estimated coefficients in [18.3.17] are consistent. 

(b) Show that the ¢ test of the null hypothesis that y = 0 is asymptotically 
N(O, 1). 

(c) Show that the ¢ test of the null hypothesis that § = 0 in the regression model of 
[18.3.17] is also asymptotically N(0, 1). 


Chapter 18 References 


Blough, Stephen R. 1992. “Spurious Regressions, with AR(1) Correction and Unit Root 
Pretest.” Johns Hopkins University. Mimeo. 

Chan, N. H., and C. Z. Wei. 1988. ‘Limiting Distributions of Least Squares Estimates of 
Unstable Autoregressive Processes.” Annals of Statistics 16:367-—401. 

Granger, C. W. J., and Paul Newbold. 1974. “Spurious Regressions in Econometrics.” 
Journal of Econometrics 2:111—20. 

Ohanian, Lee E. 1988. “The Spurious Effects of Unit Roots on Vector Autoregressions: A 
Monte Carlo Study.” Journal of Econometrics 39:251—66. 

Park, Joon Y., and Peter C. B. Phillips. 1988. “Statistical Inference in Regressions with 
Integrated Processes: Part 1.” Econometric Theory 4:468-97. 

and . 1989. “Statistical Inference in Reerosions with Integrated Processes: 
Part 2.” Econometric Theory 5:95-131. 

Phillips, Peter C. B. 1986. ‘‘Understanding Spurious Regressions in Econometrics.” Journal 
of Econometrics 33:311—40. 

. 1988. “Weak Convergence of Sample Covariance Matrices to Stochastic Integrals 
via Martingale Approximations.” Econometric Theory 4:528-33. 

and S. N. Durlauf. 1986. “Multiple Time Series Regression with Integrated Proc- 
esses.” Review of Economic Studies 53:473-95. 

and Victor Solo. 1992. “Asymptotics for Linear Processes.” Annals of Statistics 
20:971-—1001. 

Sims, Christopher A., James H. Stock, and Mark W. Watson. 1990. “Inference in Linear 
Time Series Models with Some Unit Roots.” Econometrica 58.113—44. 

Toda, H. Y., and P. C. B. Phillips. 1993a. “The Spurious Effect of Unit Roots on Exogeneity 


Chapter 18 References 569 


Tests in Vector Autoregressions: An Analytical Study.” Journal of Econometrics 59:229— 
55. 


and 


. 1993b. “Vector Autoregressions and Causality.” Econometrica forth- 
coming. 


West, Kenneth D. 1988. “Asymptotic Normality, When Regressors Have a Unit Root.” 
Econometrica 56:1397—1417. 


570 Chapter 18 | Unit Roots in Multivariate Time Series 


19 


Cointegration 


This chapter discusses a particular class of vector unit root processes known as 
cointegrated processes. Such specifications were implicit in the ‘“‘error-correction” 
models advocated by Davidson, Hendry, Srba, and Yeo (1978). However, a formal 
development of the key concepts did not come until the work of Granger (1983) 
and Engle and Granger (1987). 

Section 19.1 introduces the concept of cointegration and develops several 
alternative representations of a cointegrated system. Section 19.2 discusses tests of 
whether a vector process is cointegrated. These tests are summarized in Table 19.1. 
Single-equation methods for estimating a cointegrating vector and testing a hy- 
pothesis about its value are presented in Section 19.3. Full-information maximum 
likelihood estimation is discussed in Chapter 20. 


19.1. Introduction 


Description of Cointegration 


An(n x 1) vector time series y, is said to be cointegrated if each of the series 
taken individually is /(1), that is, nonstationary with a unit root, while some linear 
combination of the series a’y, is stationary, or J(0), for some nonzero (mn X 1) 
vector a. A simple example of a cointegrated vector process is the following bi- 
variate system: 


Yu = Wa + Uy [19.1.1] 
Yar = Yay~1 t+ Ur, [19.1.2] 


with u,, and u,, uncorrelated white noise processes. The univariate representation 
for y2, is a random walk, 


Aya, = Ua, [19.1.3] 

while differencing [19.1.1] results in 
Ayy = yAya, + Aly, = Yuy + Uy — Uyy-1- [19.1.4] 
Recall from Section 4.7 that the right side of [19.1.4] has an MA(1) representation: 
Ayy, = ¥ t+ 844-1, - [19.1.5] 


where v, is a white noise process and 6 # —1 as long as y + 0 and E(u3,) 
> 0. Thus, both y;, and y,, are J(1) processes, though the linear combination 


$71 


(Yur — Yar) is stationary. Hence, we would say that y, = (),,, ¥2,)' is cointegrated 
with a’ = (1, -y). 

Figure 19.1 plots a sample realization of [19.1.1] and [19.1.2] for y = 1 and 
u,, and U2, independent N(0, 1) variables. Note that either series (y,, or y>,) will 
wander arbitrarily far from the starting value, though y,, should remain within a 
fixed distance of yy,, with this distance determined by the standard deviation of 
Uy, 

Cointegration means that although many developments can cause permanent 
changes in the individual elements of y,, there is some long-run equilibrium relation 
tying the individual components together, represented by the linear combination 
a’y,. An example of such a system is the model of consumption spending proposed 
by Davidson, Hendry, Srba, and Yeo (1978). Their results suggest that although 
both consumption and income exhibit a unit root, over the long run consumption 
tends to be a roughly constant proportion of income, so that the difference between 
the log of consumption and the log of income appears to be a stationary process, 

Another example of an economic hypothesis that lends itself naturally to a 
cointegration interpretation is the theory of purchasing power parity. This theory 
holds that, apart from transportation costs, goods should sell for the same effective 
price in two countries. Let P, denote an index of the price level in the United States 
(in dollars per good), P* a price index for Italy (in lire per good), and S, the rate 
of exchange between the currencies (in dollars per lira), Then purchasing power 
parity holds that 


e P, = S,Pf, 
or, taking logarithms, 
Pe=S, + Pr, 


where p, = log P,, s, = log S,, and p* = log Pf. In practice, errors in measuring 
prices, transportation costs, and differences in quality prevent purchasing power 
parity from holding exactly at every date ¢. A weaker version of the hypothesis is 
that the variable z, defined by 


Z.=p,-S,-pr [19.1.6] 


FIGURE 19.1 Sample realization of cointegrated series, 


572 Chapter 19 | Cointegration 


is stationary, even though the individual elements (p,, s,, or p*) are all I(1). 
Empirical tests of this version of the puchasing power parity hypothesis have been 
explored by Baillie and Selover (1987) and Corbae and Ouliaris (1988). 

Many other interesting applications of the idea of cointegration have been 
investigated. Kremers (1989) suggested that governments are forced politically to 
maintain their debt at a roughly constant multiple of GNP, so that log(debt) — 
log(GNP) is stationary even though each component individually is not. Campbell 
and Shiller (1988a, b) noted that if y., is /(1) and yj, is a rational forecast of future 
values of yz, then y, and y, will be cointegrated. Other interesting applications 
include King, Plosser, Stock, and Watson (1991), Ogaki (1992), Ogaki and Park 
(1992), and Clarida (1991). 

It was asserted in the previous chapter that if y, is cointegrated, then it is not 
correct to fit a vector autoregression to the differenced data. We now verify this 
claim for the particular example of [19.1.1] and [19.1.2]. The issues will then be 
discussed in terms of a general cointegrated system involving n different variables. 


Discussion of the Example of [19.1.1] and [19.1.2] 


Returning to the example in [19.1.1] and [19.1.2], notice that e,, = u,, is the 
error in forecasting y., on the basis of lagged values of y, and y, while e, = 
ux, + uy, is the error in forecasting y,,. The right side of [19.1.4] can be written 


(Ya, + Uy) — Urea = Ete — (E1y-1 — YEre-1) = (1 — Loe, + yLey,. 


Substituting this into [19.1.4] and stacking it in a vector system along with [19.1.3] 
produces the vector moving average representation for (Ay,,, Ay.,)', 


na A 
= WL ; . 19.1.7 
ik @) Ex 
where 
1-L yL 
W(L) = cia i [19.1.8] 
0 1 
A VAR for the differenced data, if it existed, would take the form 
@(L)Ay, = &,, 


where ®(L) = [W(L)]-!. But the matrix polynomial associated with the moving 
average operator for this process, W(z), has a root at unity, 


a-1) y¥ 
0 1 


Hence the matrix moving average operator is noninvertible, and no finite-order 
vector autoregression could describe Ay,. 

The reason a finite-order VAR in differences affords a poor approximation 
to the cointegrated system of [19.1.1] and [19.1.2] is that the level of y, contains 
information that is useful for forecasting y, beyond that contained in a finite number 
of lagged changes in y, alone. 

If we are willing to modify the VAR by including lagged levels along with 
lagged changes, a stationary representation similar to a VAR for Ay, is easy to find. 
Recalling that u,,-1 = Yiy-1 — Y¥2,.-1» notice that [19.1.4] and [19.1.3] can be 


written as 
Ay, —L yy Yie- Yn, + Uy 
= : + . 19.1.9 
Re 0 Oflyar-1 Ur aus 


19.1. Introduction 573 


wy] = = 0. 


The general principle of which [19.1.9] provides an illustration is that with a 
cointegrated system, one should include lagged levels along with lagged differences 
in a vector autoregression explaining Ay,. The lagged levels will appear in the form 
of those linear combinations of y that are stationary. 


General Characterization of the Cointegrating Vector 


Recall that an (n x 1) vector y, is said to be cointegrated if each of its elements 
individually is J(1) and if there exists a nonzero (mn x 1) vector a such that a’y, is 
stationary. When this is the case, a is called a cointegrating vector. 

Clearly, the cointegrating vector a is not unique, for if a’y, is stationary, then 
so is ba'y, for any nonzero scalar b; if a is a cointegrating vector, then so is ba. In 
speaking of the value of the cointegrating vector, an arbitrary normalization must 
be made, such as that the first element of a is unity. 

If there are more than two variables contained in y,, then there may be two 
nonzero (n X 1) vectors a, and a, such that ajy, and ay, are both stationary, where 
a, and a, are linearly independent (that is, there does not exist a scalar b such that 
a, = ba,). Indeed, there may be h <n linearly independent (n x 1) vectors (a, 
@,... ,@,) Such that A’y, is a stationary (A x 1) vector, where A’ is the following 
(h X n) matrix:? 


ay 
ay 
+ Av=}. |. [19.1.10] 
a), 
Again, the vectors (a,, a, ... , a,) are not unique; if A’y, is stationary, then for 


any nonzero (1 x A) vector b’, the scalar b’A'y, is also stationary. Then the 
(n x 1) vector m given by m’ = b‘A' could also be described as a cointegrating 
vector. 

Suppose that there exists an (h x 7m) matrix A’ whose rows are linearly 
independent such that A‘y, is a stationary (A X 1) vector. Suppose further that if 
c' is any (1 X n) vector that is linearly independent of the rows of A‘, then c’y, is 
a nonstationary scalar. Then we say that there are exactly A cointegrating relations 
among the elements of y, and that (a,, a, ... , a,) form a basis for the space of 
cointegrating vectors. 


Implications of Cointegration 
for the Vector Moving Average Representation 
We now discuss the general implications of cointegration for the moving 


average and vector autoregressive representations of a vector system.” Since it is 
assumed that Ay, is stationary, let 8 = E(Ay,) and define 


u, = Ay, — 8. [19.1.11] 
Suppose that u, has the Wold representation 
u, = €, + Wie,_, + Wpe,. + +--+ = V(L)e,, 


'If h = n such linearly independent vectors existed, then y, would itself be 1(0). This claim will 
become apparent in the triangular representation of a cointegrated system developed in [19.1.20] and 
(19.1-21]. 

These results were first derived by Engle and Granger (1987). 


574 Chapter 19 | Cointegration 


where E(e,) = 0 and 


Q fort=7 
E(e,e!) = 
(ee) { 0 otherwise. 


Let W(1) denote the (n x n) matrix polynomial V(z) evaluated at z = 1; that is, 
WHA) HL¢ 0, + B+ ¥zte--. 
We first claim that if A’y, is stationary, then 
A'W(1) = 0. [19.1.12] 


To verify this claim, note that as long as {s-W,}*_, is absolutely summable, the 
difference equation [19.1.11] implies that 


Y=Yyot St+u tut +y, 

2 [19.1.13] 
=yo t+ Stt+ Wil)-(e, +e, +--- +8) +m - MW, 

where the last line follows from [18.1.6] for n, a stationary process. Premultiplying 

[19,1.13] by A’ results in 


A'y,=A‘(Yo — No) t ASC + AW) (€, tert: + +e) +A'y,. [19.1.14] 


If E(e,e;) is nonsingular, then c’(e, + &, + --- + ®,) is I(1) for every nonzero 
(n x 1) vector c. However, in order for y, to be cointegrated with cointegrating 
vectors given by the rows of A’, expression [19.1.14] is required to be stationary. 
This could occur only if A’W#(1) = 0. Thus, [19.1.12] is a necessary condition for 
cointegration, as claimed. 

As emphasized by Engle and Yoo (1987) and Ogaki and Park (1992), con- 
dition [19.1.12] is not by itself sufficient to ensure that A’y, is stationary. From 
[19.1.14], stationarity further requires that 


AS = 0. [19.1.15] 


If some of the series exhibit nonzero drift (6 # 0), then unless the drift across 
series satisfies the restriction of [19.1,15], the linear combination A’y, will grow 
deterministically at rate A‘8. Thus, if the underlying hypothesis suggesting the 
possibility of cointegration is that certain linear combinations of y, are stable, this 
requires that both [19.1.12] and [19.1.15] hold. 

Note that [19.1.12] implies that certain linear combinations of the rows of 
W(1), such as aiW(1), are zero, meaning that the determinant |W(z)| = 0 at 
z = 1. This in turn means that the matrix operator W(L) is noninvertible. Thus, 
a cointegrated system can never be represented by a finite-order vector auto- 
regression in the differenced data Ay,. 

For the example of [19.1.1] and [19.1.2], we saw in [19.1.7] and [19.1.8] that 


(2) = |? | 


0 
W(1) = E 


This is a singular matrix with A’'W(1) = Ofor A’ = [1 —y]. 


and 


19.1. Introduction 575 


Phillips’s Triangular Representation 


Another convenient representation for a cointegrated system was introduced 
by Phillips (1991). Suppose that the rows of the (h x n) matrix A’ form a basis 
for the space of cointegrating vectors. If the (1, 1) element of A’ is nonzero, we 
can conveniently normalize it to unity. If, instead, the (1, 1) element of A’ is zero, 
we can reorder the elements of y, so that y,, is included in the first cointegrating 
relation. Hence, without loss of generality, we take 


t 
a Lap 43 *** a, 
f owe 
At a | | 41 2 dos Gon 
a 
a, Any Anz Ans °°" Ann 


If a2, times the first row of A’ is subtracted from the second row, the resulting row 
is a new cointegrating vector that is still linearly independent of a,, a3,...,a,.° 
Similarly we can subtract a,, times the first row of A‘ from the third row, and a,, 
times the first row from the Ath row, to deduce that the rows of the following 
matrix also constitute a basis for tlie space of cointegrating vectors: 


Lo az a3 +'* ay 
* * gees * 
; 0 ay a} Q2n 
Ai = 
a * * Se * 
0 af. as Ghn 


Next, suppose that a3, is nonzero; if a}, = 0, we can again switch y, with some 
variable y3,, Ya, - +--+» Yar that does appear in the second cointegrating relation. 
Divide the second row of A; by @3,. The resulting row can then be multiplied by 
4, and subtracted from the first row. Similarly, a3, times the second row of A; can 
be subtracted from the third row, and a7, times the second row can be subtracted 
from the hth. Thus, the space of cointegrating vectors can also be represented by 


1 0 aff --- af? 

* ae ae 

; 0 1 aff - az; 
A=]... 

at ee 

0 0 azz Gin 


3Since the first and second moments of the (A x 1) vector 
ay 
a} 
y: 
a, 
do not depend on time, neither will the first and second moments of 


, 
a 


By — a2} 
S " 
a 
Furthermore, the assumption that a,, a), ..., a, are linearly independent means that no linear com- 
bination of a,, a, ... , 9 is zero, and so no linear combination of a,, 8; — @2,8;,... , &, can be zero 
either. Hence a,, a, — @2,8,,..., a, also constitute a basis for the space of cointegrating vectors. 


576 Chapter 19 | Cointegration 


Proceeding through each of the A rows of A’ in this fashion, it follows that 
given any (m x 1) vector y, that is characterized by exactly A cointegrating relations, 
it is possible to order the variables (y1,, Ya, --+, Yar) in such a way that the 
cointegrating relations can be represented by an (A X n) matrix A’ of the form 


LQ eto yas Vinee: 9 ie 
eee Ode oe DE yea eae Pe OE, 
moe : [19.1.16] 
0 0 +++ Lo Yaar Yanez 1° Yan 
=(1, —F', 


where I” is an (h x g) matrix of coefficients for g =n — h. 
Let z, denote the residuals associated with the set of cointegrating relations: 


a, =Aly,. [19.1.17] 
(Ax 1) 


Since z, is stationary, the mean wf = E(z,) exists, and we can define 


zi =4, — pf. [19.1.18] 
Partition y, as 
2 
y, = | @*?]- [19.1.19] 
(nx 1) Yar 
(gx1) 


Substituting [19.1.16], [19.1.18], and [19.1.19] into [19.1.17] results in 


at + pt = [L, -r{™] 


Yu 


or 


Y, = I’: yy + pt + af. [19.1.20] 
(Ax 1) (hxg) (g 1) (hx 1) (hx 1) 


A representation for y,, is given by the last g rows of [19.1.11]: 


Ayr, = 82 .+ Uy , [19.1.21] 
(gx) (x1) @x1) 


where 8, and u,, represent the last g elements of the (n x 1) vectors & and u,, 
respectively. Equations [19.1.20] and [19.1.21] constitute Phillips’s (1991) triangular 
representation of a system with exactly A cointegrating relations. Note that z* and 
u,, represent Zero-mean stationary disturbances in this representation. 

If a vector y, is characterized by exactly A cointegrating relations with the 
variables ordered so that [19.1.20] and [19.1.21] hold, then the (g x 1) vector y,, 
is (1) with no cointegrating relations. To verify this last claim, notice that if some 
linear combination c’y,, were stationary, this would mean that (0’, c’)y, would be 
stationary or that (0’, c’) would be a cointegrating vector for y,. But (0’, c’) is 
linearly independent of the rows of A’ in [19.1.16], and by the assumption that the 
rows of A’ constitute a basis for the space of cointegrating vectors, the linear 
combination (0’, c’)y, cannot be stationary. , 

Expressions [19.1.1] and [19.1.2] are a simple example of a cointegrated 
system expressed in triangular form. For the purchasing power parity example 


19.1. Introduction 577 


[19.1.6], the triangular representation would be 
Pe = VS + Yop? + wt + Ze 
As, = 8,+ Us 
Apr = Sp- + Ups, 


where the hypothesized values are y, = y, = 1. 


The Stock-Watson Common Trends Representation 


Another useful representation for any cointegrated system was proposed by 
Stock and Watson (1988). Suppose that an (n x 1) vector y, is characterized by 
exactly A cointegrating relations with g = n — h. We have seen that it is possible 
to order the elements of y, in such a way that a triangular representation of the 
form of [19.1.20] and [19.1.21] exists with (z7‘, u3,)' a stationary (n x 1) vector 
with zero mean. Suppose that 


[e)-aftes 
Ux, s=0 | Js€,-s 
for e, an (n X 1) white noise process, with {s-H,}%, and {s-J,}F.) absolutely 


summable sequences of (h x n) and (g X x) matrices, respectively. Adapting the 
result in [18.1.6], equation [19.1.21] implies that 


t 
= + Bt + u 
Yar = Yoo + Bt + DB) Uy [19.1.22] 


= Yoo + Bt + J(1)-(e, + eg + °° + &) + Ne — Tho, 
where J(1) = (Jo + Ji + Jo +++ °)s The = DRG €;_,, and a, —(Jnar + 
J,42 + Jsu3 + °° +). Since the (n x 1) vector e, is white noise, the (g x 1) vector 


J(1)-e, is also white noise, implying that each element of the (g x 1) vector &,, 
defined by 


£2, = J(1)-(e, + &2+ +++ + &,) [19.1.23] 


is described by a random walk. 
Substituting [19.1.23] into [19.1.22] results in 


Yor = By + By:t + Ext Na [19.1.24] 
for fi2 = (Yoo — Nz). Substituting [19.1.24] into [19.1.20] produces 
| Yu = Pr + V'(8a:t + €2) + Te [19.1.25] 


for fp, = pf + I'p, and q,, = 2% + T'ny,. 
Equations [19.1.24] and [19.1.25] give Stock and Watson’s (1988) common 
trends representation. These equations show that the vector y, can be described as 


a Stationary component, 
Be Tae 


plus linear combinations of up to g common deterministic trends, as described by 
the (g x 1) vector 8,-¢, and linear combinations of g common random walk variables 
as described by the (g x 1) vector &,,. 


578 Chapter 19 | Cointegration 


Implications of Cointegration 
for the Vector Autoregressive Representation 
Although a VAR in differences is not consistent with a cointegrated system, 


a VAR in levels could be. Suppose that the level of y, can be represented as a 
nonstationary pth-order vector autoregression: 


y = at By,_) + By, +--+ + Dy, + &, [19.1.26] 
or 
O(L)y, = a + &, [19.1.27] 
where 
@(L) =1, - @L - @,L? —--. - ®,L. [19.1.28] 
Suppose that Ay, has the Wold representation 
(1 — L)y, = 6 + W(L)e,. [19.1.29] 
Premultiplying [19.1.29] by ®(L) results in 
(1 — L)@(L)y, = B(1)5 + B(L)W(L)e,. [19.1.30] 
Substituting [19.1.27] into [19.1.30], we have 
(1 — Lye, = B(1)8 + B(L)¥(L)e,, [19.1.31] 


since (1 — L)a = 0. Now, equation [19.1.31] has to hold for all realizations of e,, 
which requires that 


#(1)8 = 0 [19.1.32] 


and that (1 — L)I, and ®(L)W(L) represent the identical polynomials in L. This 
means that 


(1 — zjl, = ®(z)W(z) [19.1.33] 
for all values of z. In particular, for z = 1, equation [19.1,33] implies that 
(1) (1) = 0. [19.1.34] 


Let ‘ denote any row of ®(1). Then [19.1.34] and [19.1.32] state that m'W(1) 

= 0' and w'S = 0. Recalling [19.1.12] and [19.1.15], this means that wm is a 

cointegrating vector. If a,, a), ... , a, form a basis for the space of cointegrating 

vectors, then it must be possible to express m as a linear combination of a,, a), 
. , &,—that is, there exists an (A x 1) vector b such that 


ma =[a, a, --- a,jb 
or 
a’ = DA’ 
for A‘ the (A x n) matrix whose ith row is aj. Applying this reasoning to each of 
the rows of ®(1), it follows that there exists an (n x A) matrix B such that 
@(1) = BA’. [19.1.35] 


Note that [19.1.34] implies that @(1) is a singular (n x nm) matrix—linear 
combinations of the columns of ®(1) of the form ®(1)x are zero for x any column 
of W(1). Thus, the determinant |®(z)| contains a unit root: 


I, — ®z! — Bz? -----@ 27) =0 atz=1. 


19.1. Introduction 579 


Indeed, in the light of the Stock-Watson common trends representation in [19.1.24] 
and [19.1.25], we could say that ®(z) contains g = n — A unit roots. 


Error-Correction Representation 


A fina] representation for a cointegrated system is obtained by recalling from 
equation [18.2.5] that any VAR in the form of [19.1.26] can equivalently be written 
as 


y, = Gi Ay,-1 + G2 Ay,-2 +--+ + 0p-1AY:-par t+ a+ py-1+¢e,, [191.36] 
where 
p=%,+0,+---+4, [19.1.37] 


C.= —[®.4, + @O4.2+°::+0,) fors=1,2,...,p—1. [19.1.38] 
Subtracting y,_; from both sides of [19.1.36] produces 
Ay, = C,Ay,-1 + t, Ay,_2 a Nee €p-1 AY:-p 41 +a+Coy,-1 + &, [19.1.39] 


where 
f&=e-1,=—-, - 2 - ®,---:-— ®,) = -@(1). [19.1.4] 
Note that if y, has A cointegrating relations, then substitution of [19.1.35] and 
[19.1.40] into [19.1.39] results in 
Ay, = G Ay,-1 + G2 Aya °° + + b)-1 AY,-par FO -BA’y,1+€,.  [19.1.41] 
Define z, = A‘y,, noticing that z, is a stationary (h x 1) vector. Then [19.1.41] can 
be written 
Ay, = G, Ay,-1 + G2 Ay,-2 + -° + €p-1 AY:- pat +a — Bz,_, + &,. [19.1.42] 
Expression [19.1.42] is known as the error-correction representation of the 
cointegrated system. For example, the first equation takes the form 
Ayu = EPAY 1 ¢-1 + CWP AY2 y= eo CP AY neH1 
+ OP AY enn + PAY2r-2 $2 + ER AYa st 
‘+ CEA yigapat + CE Aya spe et Se alae \ year ee 
+ ay — by Zap — OyeZoy-1 — + — Oi Zag-1 + Eu 
where ¢{ indicates the row i, column j element of the matrix {,, bj indicates the 
row i, column j element of the matrix B, and z,, represents the ith element of z,. 
Thus, in the error-correction form, changes in each variable are regressed on a 
constant, (p — 1) lags of the variable’s own changes, (p — 1) lags of changes in 
each of the other variables, and the levels of each of the A elements of z,_;. 


For example, recall from [19.1.9] that the system of [19.1.1] and [19.1.2] can 
be written in the form 


Rea = BE | Be + le + " 
Ayx 0 0 Y2e-1 Ux 
Note that this is a special case of [19.1.39] with p = 1, 
lo = 0 HF 
Ey, = Ylug, + Uy, Ey = Uz, and all other parameters in [19.1.39] equal to zero. 


580 Chapter 19 | Cointegration 


The error-correction form is 


Ay} -1 £1, 
= _it+ ; 
fs 0 a Ex 
where 2, = Vie — VY 2e- 


An economic interpretation of an error-correction representation was pro- 
posed by Davidson, Hendry, Srba, and Yeo (1978), who examined a relation 
between the log of consumption spending (denoted c,) and the log of income (y,) 
of the form 


(1 — L')c, = B,(1 — L)y, + Bo(1 — L4)y,-1 + Bs(C:-4 — Y;-4) + u,.  [19.1.43] 


This equation was fitted to quarterly data, so that (1 — L*)c, denotes the percentage 
change in consumption over its value in the comparable quarter of the preceding 
year. The authors argued that seasonal differences (1 — L‘) provided a better 
description of the data than would simple quarterly differences (1 — L). Their 
claim was that seasonally differenced consumption (1 — L*)c, could not be de- 
scribed using only its own fags or those of seasonally differenced in- 
come. In addition to these factors, [19.1.43] includes the “error-correction” term 
B3(c,-4 — Y:-4)- One could argue that there is a long run, historical average ratio 
of consumption to income, in which case the difference between the logs of con- 
sumption and income, c, — y,, would be a stationary random variable, even though 
log consumption or log income viewed by itself exhibits a unit root. For B, < 0, 
equation [19.1.43] asserts that if consumption had previously been a larger-than- 
normal share of income (so that c,_, — y,-4 is larger than normal), then that causes 
c, to be lower for any given values of the other explanatory variables, The term 
(c,-4 — Y;—4) is viewed as the “error” from the long-run equilibrium relation, and 
B; gives the “correction” to c, caused by this error. 


Restrictions on the Constant Term 
in the VAR Representation 


Notice that all the variables appearing in the error-correction representation 
[19.1.42] are stationary. Taking expectations of both sides of that equation results 
in 


(I, _ qi — t =p. = t,-8 7 aS Bui, [19.1.44] 
where § = E(Ay,) and wf = E(z,). Assuming that the roots of 
I, — Giz - o22? a od p-1z"7 "| = 0 
are all outside the unit circle, the matrix (1, — €; — 2 — - - - — €,-1) isnonsingular. 


Thus, in order to represent a system in which there is no drift in any of the variables 
(8 = 0), we would have to impose the restriction 


a=Bpf [19.1.45] 


In the absence of any restriction on a, the system of [19.1.42] implies that there 
are g separate time trends that account for the trend in y,. 


Granger Representation Theorem 


For convenience, some of the preceding results are now summarized in the 
form of a proposition. 


19.1. Introduction 581 


Proposition 19.1: (Granger representation theorem). Consider an (n x 1) vector 
y, where Ay, satisfies [19.1.29] for ©, white noise with positive definite variance- 
covariance matrix and {s- WV ,}7_.) absolutely summable. Suppose that there are exactly 
h cointegrating relations among the elements of y,. Then there exists an (h X n) 
matrix A' whose rows are linearly independent such that the (h x 1) vector z, defined 
by 


z, = A'y, 
is stationary. The matrix A' has the property that 
A'W(1) = 0. 


If, moreover, the process can be represented as the pth-order VAR in levels as in 
equation [19.1.26], then there exists an (n X h) matrix B such that 


(1) = BA’, 
and there further exist (n x n) matrices (), €2,... , ¢,-, such that 


y, = GiAy,-, + GAy,-2 + ° °° + €p-1AY:— p41 +a — Bz,_, + &. 


19.2. Testing the Null Hypothesis 
of No Cointegration 


This section discusses tests for cointegration. The approach will be to test the null 
hypothesis that there is no cointegration among the elements of an (n x 1) 
vector y,; rejection of the null is then taken as evidence of cointegration. 


Testing for Cointegration When 
the Cointegrating Vector Is Known 


Often when theoretical considerations suggest that certain variables will be 
cointegrated, or that a’y, is stationary for some (n x 1) cointegrating vector a, the 
theory is based on a particular known value for a. In the purchasing power parity 
example [19.1.6], a = (1, —1, —1)’. The Davidson, Hendry, Srba, and Yeo 
hypothesis (1978) that consumption is a stable fraction of income implies a co- 
integrating vector of a = (1, —1)', as did Kremers’s assertion (1989) that govern- 
ment debt is a stable multiple of GNP. 

If the interest in cointegration is motivated by the possibility of a particular 
known cointegrating vector a, then by far the best method is to use this value 
directly to construct a test for cointegration. To implement this approach, we first 
test whether each of the elements of y, is individually J(1). This can be done using 
any of the tests discussed in Chapter 17. Assuming that the null hypothesis of a 
unit root in each series individually is accepted, we next construct the scalar z, = 
a’y,. Notice that if a is truly a cointegrating vector, then a’y, will be (0). If a is 
not a cointegrating vector, then a’y, will be J(1). Thus, a test of the null hypothesis 
that z, is J(1) is equivalent to a test of the null hypothesis that y, is not cointegrated. 
If the null hypothesis that z, is J(1) is rejected, we would conclude that z, = a’y, 
is stationary, or that y, is cointegrated with cointegrating vector a. The null hy- 
pothesis that z, is J(1) can also be tested using any of the approaches in Chap- 
ter 17. 

For example, Figure 19.2 plots monthly data from 1973:1 to 1989:10 for the 
consumer price indexes for the United States (p,) and Italy (p*), along with the 


582 Chapter 19 | Cointegration 


200 


1S0 


so 


-100 


-1S0 


73 7S 77 73 81 83 6S 87 69 


FIGURE 19.2 One hundred times the log of the price level in the United States 
(p,), the dollar-lira exchange rate (s,), and the price level in Italy (p/*), monthly, 
1973-89. Key: ---- p,; —-— 8,5 Pi. 


exchange rate (s,), where s, is in terms of the number of U.S. dollars needed to 
purchase an Italian lira. Natural logs of the raw data were taken and multiplied 
by 100, and the initial value for 1973:1 was then subtracted, as in 


D, = 100-[log(P,) — log(Pi973.1)]- 


The purpose of subtracting the constant log(P1973.1) from each observation is to 
normalize each series to be zero for 1973:1 so that the graph is easier to read. 
Multiplying the log by 100 means that p, is approximately the percentage difference 
between P, and its starting value Pj973.;. The graph shows that Italy experienced 
about twice the average inflation rate of the United States over this period and 
that the lira dropped in value relative to the dollar (that is, s, fell) by roughly this 
same proportion. 
Figure 19.3 plots the real exchange rate, 


= * 
Z,= Dr — 8; — Dy. 


It appears that the trends are eliminated by this transformation, though deviations 
of the real exchange rate from its historical mean can persist for several years. 
To test for cointegration, we first verify that p,, p*, ands, are each individually 
(1). Certainly, we anticipate the average inflation rate to be positive (E(Ap,) > 
0), so that the natural null hypothesis is that p, is a unit root process with positive 
drift, while the alternative is that p, is stationary around a deterministic time trend. 
With monthly data it is a good idea to include at least twelve lags in the regression. 
Thus, the following model was estimated by OLS for the U.S. data for ¢ = 1974:2 


19.2. Testing the Null Hypothesis of No Cointegration 583 


40 


at 


+30 
73 7S 77 79 a 83 6S 87 69 


FIGURE 19.3 The real dollar-lira exchange rate, monthly, 1973-89. 


— 1989:10 (standard errors in parentheses): 
= 0.55: Ap, 17 ore 2+ es 3+ ede de 4 


(0.08) 
— 0.08 A ~ 0.05 Ap, +0174 “ogra 
(0.08) iP 5 P:—6 P+ -7 iP r-8 [19.2.1] 
+ 0.24 Ap, — ant Ap,- wy + 012 Ap, ‘a i ‘008 ap,. eS 
(0.07) 
+ 0.14 + 0.99400 Pia t 0.0029 t. 
(0.09) (0.00307) (0.0018) 


The ¢ statistic for testing the nul! hypothesis that p (the coefficient on p,_,) is unity 
is 
t = (0.99400 — 1.0)/(0.00307) = —1.95. 


Comparing this with the 5% crtical value from the case 4 section of Table B.6 for 
a sample of size T = 189, we see that —1.95 > —3.44. Thus, the null hypothesis 
of a unit root is accepted. The F test of the joint null hypothesis that p = 1 and 

= 0 (for p the coefficient on p,_, and 6 the coefficient on the time trend) is 2.41. 
Comparing this with the critical value of 6.40 from the case 4 section of Table B.7, 
the null hypothesis is again accepted, further confirming the impression that U.S. 
prices follow a unit root process with drift. 

If p, in [19.2.1] is replaced by p*, the augmented Dickey-Fuller ¢ and F tests 
are calculated to be —0.13 and 4.25, respectively, so that the null hypothesis that 
the Italian price level follows an I(1) process is again accepted. When p, in [19.2.1] 
is replaced by s,, the ¢and F tests are —1.58 and 1.49, so that the exchange rate 
likewise admits an ARIMA(12, 1, 0) representation. Thus, each of the three series 
individually could reasonably be described as a unit root process with drift. 


584 Chapter 19 | Cointegration 


The next step is to test whether z, = p, — s, — p? is stationary. According 
to the theory, there should not be any trend in z,, and none appears evident in 
Figure 19.3. Thus, the augmented Dickey-Fuller test without trend might be used. 
The following estimates were obtained by OLS: 


z, = 0.32 Az,-; — 0.01 Az,_, + 0.01 Az,_3 + 0.02 Az,_, 


(0.07) (0.08) (0.08) (0.08) 
+ 0.08 Az,_,; — 0.00 Az,_, + 0.03 Az,_, + 0.08 Az,_ 
1.08) t-5 (0.08) 1-6 bie ‘-7 (008) 1-8 [19.2.2] 
— 0.05 Az,» + 0.08 Az, wt 0.05 Az,_ uct 0.01 Az, is 
(0.08) (0.08) 
+ 0.00 + rere Bites 
(0.18) (0,01410) 


Here the augmented Dickey-Fuller ¢ test is 
t = (0.97124 — 1.0)/(0.01410) = —2.04. 


Comparing this with the 5% critical value for case 2 of Table B.6, we see that 
—2.04 > —2.88, and so the null hypothesis of a unit root is accepted. The F test 
of the joint null hypothesis that p = 1 and that the constant term is zero is 2.19 
< 4.66, which is again accepted. Thus, we could accept the null hypothesis that 
the series are not cointegrated. 

Alternatively, the null hypothesis that z, is nonstationary could be tested using 
the Phillips-Perron tests. OLS estimation gives 


z, = —0.030 + 0.98654 z,_, + &, 


(0.178) (0.01275) 


with 
T 
# = (T - 2)"! p> a? = (2.49116)? 
fod 
T 
é= TY aa; 
emft+] 
& = 6.144 
e 12 
= é + 2-> [1 - (j/13)]é, = 13.031, 
j=l 


The Phillips-Perron Z, test is then 
Z, = T(6 -— 1) ~ HT*6, + s}(X? - &) 
(201)(0.98654 — 1) 


—(201)(0.01275) + (2.49116)}*(13.031 - 6.144) 
= —6,35. 


Since — 6,35 > —13.9, the null hypothesis of no cointegration is again pee 
Similarly, the Phillips-Perron Z, test is 

Z, = (G/K?)!7(6 - 16, — HTG, + s}(A2 — eyVA 
(6.144/13.031)¥7(0.98654 — 1)/(0.01275) 
—{(201)(0.01275) + (2.49116)}(13.031 ~ 6.144)/(13.031)'? 
-1.71, 


which, since — 1.71 > —2.88, gives the same conclusion as the other tests. 


19.2. Testing the Null Hypothesis of No Cointegration 585 


Clearly, the comments about the observational equivalence of I(0) and I(1) 
processes are also applicable to testing for cointegration. There exist both 1(0) and 
(1) representations that are perfectly capable of describing the observed data for 
z, plotted in Figure 19.3. Another way of describing the results is to calculate how 
long a deviation from purchasing power parity is likely to persist. The regression 
of [19.2.2] implies an autoregression in levels of the form 


Z,= a + by2-1 + dota t+ + bysziiag + &, 


for which the impulse-response function, 


y= Fe 
&; 

can be calculated using the methods described in Chapter 1. Figure 19.4 plots the 
estimated impulse-response coefficients as a function of j. An unanticipated increase 
in z, would cause us to revise upward our forecast of z,,, by 25% even 3 years into 
the future (yf. = 0.27). Hence, any forces that restore z, to its historical value 
must operate relatively slowly. The same conclusion might have been gleaned from 
Figure 19.3 directly, in that it is clear that deviations of z, from its historical norm 
can persist for a number of years. 


Estimating the Cointegrating Vector 


If the theoretical model of the system dynamics does not suggest a particular 
value for the cointegrating vector a, then one approach to testing for cointegration 
is first to estimate a by OLS. To see why this produces a reasonable initial estimate, 


1.4 


0.0 


FIGURE 19.4 Impulse-response function for the real dollar-lira exchange rate. 
Graph shows = 0(p,4; — 5:4; — Pt+;)/€, as a function of j. 


586 Chapter 19 | Cointegration 


note that if z, = a’y, is stationary and ergodic for second moments, then 


T T 
TS 22 = T-1 YS (a’y,)? S E(z?). [19.2.3] 
fmt t=t 
By contrast, if a is not a cointegrating vector, then z, = a’y, is 1(1), and so, from 
result (h) of Proposition 17.3, 


T L 1 
T-? S. (a’y,? 3 a?- [ [W(r)F dr, [19.2.4] 


where W(r) is standard Brownian motion and A is a parameter determined by the 
autocovariances of (1 — L)z,. Hence, if a is not a cointegrating vector, the statistic 
in [19.2.3] would diverge to +. 

This suggests that we can obtain a consistent estimate of a cointegrating vector 
by choosing a so as to minimize [19.2.3] subject to some normalization condition 
on a. Indeed, such an estimator turns out to be superconsistent, converging at rate 
T rather than 7’. 

If it is known for certain that the cointegrating vector has a nonzero coefficient 
for the first element of y, (a, # 0), then a particularly convenient normalization 


is to set a, = 1 and represent subsequent entries of a (a2, a3, ..., a,) as the 
negatives of a set of unknown parameters (72, Y3,- ++. Yn): 
a, 1 
a2 —Y2 
a3) = | —7¥3 |. {19.2.5] 
a, —Vn 
In this case, the objective is to choose (y, y3, - . - 5 Y,) SO as to minimize 
T T 
T"? > (a’y)? = T7? >», (Yur — Y2¥u — ¥aY3ar — °° — Yn¥n)*. [19.2.6] 
= = 


This minimization is, of course, achieved by an OLS regression of the first element 
of y, on all of the others: 


Yue = YoYo + ¥3¥ar toot + YaYae + Ue [19.2.7] 


Consistent estimates of y2, y3,..., y, are also obtained when a constant term is 
included in [19.2.7], as in 


Yar = & + YoYo + ys¥se + 05+ + YaYnr + Uy [19.2.8] 
or 
Yu = a+ Y'Ya + Uy, 
where y' = (Ya, Ya, ++ + + Yn) ANG Yar = (Yor, Yars- -  s Vat)’ 


These points were first analyzed by Phillips and Durlauf (1986) and Stock 
(1987) and are formally summarized in the following proposition. 


Proposition 19.2: Let y,, be a scalar and y2, be a (g X 1) vector. Letn =g + 1, 

and suppose that the (n x 1) vector (yu, Y2)' is characterized by exactly one 

cointegrating relation (h = 1) that has a nonzero coefficient on y,,. Let the triangular 
‘ 


19.2. Testing the Null Hypothesis of No Cointegration 587 


representation for the system be 


Yu = @ + Y'Yn + ZF [19.2.9] 
Aya, = U2,. [19.2.10] 
Suppose that 
zr 
= W*(L)e, [19.2.1] 
U2, 


where ©, is an (n X 1) i.i.d. vector with mean zero, finite fourth moments, and 
positive definite variance-covariance matrix E(e,e;) = PP’. Suppose further that the 
sequence of (n X n) matrices {s:W3}"_, is absolutely summable and that the rows 
of W*(1) are linearly independent. Let &; and ¥7 be estimates based on OLS esti- 
mation of [19.2.9], 


-1 
ay T yr Dy 
= ; 19.2.12 
ba & Lyon) | ZY 
where > indicates summation over t from 1 to T. Partition V*(1):P as 
ar’ . 
wr(y-P = | OP? 
a AS 
(gxn) 
Then ; 
; -1 
W(r)]' dr t+ Az’ 
oe “|. 1 {ft (r)] } 2 R 
Tr ~ ¥) h, |’ 


As-[ WD dr ar{ [wor two ar} “Ay 
[19.2.13] 


where W(r) is n-dimensional standard Brownian motion, the integral sign denotes 
integration over r from 0 to 1, and 


hy = -W(1) 


hy = ar [ow rawoor'| At + DY) Blut.) 


Note that the OLS estimate of the cointegrating vector is consistent even 
though the error term u, in [19.2.8] may be serially correlated and correlated with 
Ay2,, AY3,, - - » , Ayne The latter correlation would contribute a bias in the limiting 
distribution of T(7; — y), for then the random variable h, would not have mean 
zero. However, the bias in ¥7 is O,(T~'). 

Since the OLS estimates are consistent, the average squared sample residual 
converges to 


T 
TD Wr Eu), 
tel 
whereas the sample variance of y,,, 
T 
T7* % Ow ~ Was 


588 Chapter 19 | Cointegration 


diverges to +. Hence, the R? for the regression of [19.2.8] will converge to unity 
as the sample size grows. ; 

Cointegration can be viewed as a structural assumption under which certain 
behavioral relations of interest can be estimated from the data by OLS. Consider 
the supply-and-demand example in equations [9.1.2] and [9.1.1]: 


gQ=ywp.+e [19.2.14] 
qi = Bp, + ef. [19.2.15] 


We noted in equation [9.1.6] that if ef and eS are i.i.d. with Var(e‘) finite, then 
as the variance of e% goes to infinity, OLS estimation of [19.2.14] produces a 
consistent estimate of the supply elasticity y despite the potential simultaneous 
equations bias. This is because the large shifts in the demand curve effectively trace 
out the supply curve in the sample; see Figure 9.3. More generally, if ef is 1(0) 
and ef is (1), then [19.2.14] and [19.2.15] imply that (q,, p,)' is cointegrated with 
cointegrating vector (1, —-y)’. In this case the cointegrating vector can be consis- 
tently estimated by OLS for essentially the same reason as in Figure 9.3. The 
hypothesis that a certain structural relation involving /(1) variables is characterized 
by an 1(0) disturbance amounts to a structural assumption that can help identify 
the parameters of the structural relation. 

Although the estimates based on [19.2.8] are consistent, there often exist 
alternative estimates that are superior. These will be discussed in Section 19.3. 
OLS estimation of [19.2.8] is proposed only as a quick way to obtain an initial 
estimate of the cointegrating vector. 

It was assumed in Proposition 19.2 that Ay2, had mean zero. If, instead, 
E(Ay2,) = 5,, it is straightforward to generalize Proposition 19.2 using a rotation 
of variables as in [18.2.43]; for details, see Hansen (1992). As long as there is no 
time trend in the true cointegrating relation [19.2.9], the estimate 47 based on OLS 
estimation of [19.2.8] will be superconsistent regardless of whether the J(1) vector 
Y2, includes a deterministic time trend or not. 


The Role of Normalization 


The OLS estimate of the cointegrating vector was obtained by normalizing 
the first element of the cointegrating vector a to be unity. The proposal was then 
to regress the first element of y, on the others. For example, with n = 2, we would 


regress y,,0n yz, 
Yu = a+ Wa + uy 


Obviously, we might equally well have normalized a, = 1 and used the same 
argument to suggest a regression of y2, on yj: 


Ya = 8+ Ryu + v, 


The OLS estimate 8 is not simply the inverse of 4, meaning that these two regres- 
sions will give different estimates of the cointegrating vector: 


(2) «1 


Only in the limiting case where the R? is 1 would the two estimates coincide. 
Thus, choosing which variable to call y, and which to call y. might end up 

making a material difference for the estimate of a as well as for the evidence one 

finds for cointegration among the series. One approach that avoids this normali- 


19.2. Testing the Null Hypothesis of No Cointegration 589 


zation problem is the full-information maximum likelihood estimate proposed by 
Johansen (1988, 1991). This will be discussed in detail in Chapter 20. 


What Is the Regression Estimating When There Is More 
Than One Cointegrating Relation? 


The limiting distribution of the OLS estimate in Proposition 19.2 was derived 
under the assumption that there is just one cointegrating relation (h = 1). In the 
more general case with h > 1, OLS estimation of [19.2.8] should still provide a 
consistent estimate of a cointegrating vector by virtue of the argument given in 
[19.2.3] and [19.2.4]. But which cointegrating vector is it? 

Consider the general triangular representation for a vector with A cointe- 
grating relations given in [19.1.20] and (19.1.21]: , 


Yu = pr t+ V’yy, + af [19.2.16] 

Aya, = 82 + Uy, [19.2.17] 

where the (h xX 1) vector y,, contains the first h elements of y, and y,, contains the 

remaining g elements. Since z/ = (z}., 23,,.. - , Z7,)’ is covariance-stationary with 

mean zero, we can define 82, 83, ... , 8, to be the population coefficients asso- 
ciated with a linear projection of z?, on z3,, Z3;,,..., 2 

z¥ = Bazi, + Bz + +++ + Byzt + u,, (19.2.18] 


where u, by construction has mean zero and is uncorrelated with z3,, zj,,..., 
Ze 

The following proposition, adapted from Wooldridge (1991), shows that the 
sample residual 2, resulting from OLS estimation of [19.2.8] converges in proba- 
bility to the population residual u, associated with the linear projection in [19.2.18]. 
In other words, among the set of possible cointegrating relations, OLS estimation 
of [19.2.8] selects the relation whose residuals are uncorrelated with any other 7(1) 
linear combinations of (y2,, y3,, -- - + Yat): 


Proposition 19.3: Let y, = (Yi, ¥,)' satisfy [19.2.16] and [19.2.17] with y,, an 
(h X 1) vector with h > I, and let B,, Bs, ..., B, denote the linear projection 
coefficients in [19.2.18]. Suppose that 


z* ied 
t 
= pS WF Ey-s, 
Up, s=0 


where {s:‘W2}"_ is absolutely summable and €, is ani.i.d. (n X 1) vector with mean 
zero, variance PP’, and finite fourth moments. Suppose further that the rows of 
W*(1)-P are linearly independent. Then the coefficient estimates associated with 
OLS estimation of 


Vir = & + Y2Yae + Vase te + + War + Uy [19.2.19] 
converge in probability to 
é,5[1 —B']py, [19.2.20] 
where : 


= (B2, Bs, ots > By)’ 


(h-1)x1 


590 Chapter 19 | Cointegration 


and 


Y2,T 

4s, 

715 [8 [19.2.21] 
: Y2 

Var 


where 


Proposition 19.3 establishes that the sample residuals associated with OLS 
estimation of [19.2.19] converge in probability to 


Yu — Br — Fa,r¥a — ¥s,7r¥3s — °° ~ Var Yas 
Yr Yatis 
Y3e Va+2, 
4yu- (1 -B ler - Bs}. |- -B |. 
Yne Yar 
= {1 —B'] {yy a pt = Tyo} 
= [1 -B')-2?, 


with the last equality following from [19.2.16]. But from [19.2,18] these are the 
same as the population residuals associated with the linear projection of zj, on 
2h, 23,0 1s Zhe 

This is an illustration of a general property observed by Wooldridge (1991). 
Consider a regression model of the form 


y= a+ XB + uy, [19.2.22] 


If y, and x, are 1(0), then a + x;B was said to be the linear projection of y, on x, 
and a constant if the population residual u, = y, — a — x;B has mean zero and 
is uncorrelated with x,. We saw that in such a case OLS estimation of [19.2.22] 
would typically yield consistent estimates of these linear projection coefficients. In 
the more general case where y, can be J(0) or I(1) and elements of x, can be 1(0) 
or J(1), the analogous condition is that the residual u, = y, — a — x; is a zero- 
mean stationary process that is uncorrelated with all /(0) linear combinations of 
x,. Then a + x;B can be viewed as the J(1) generalization of a population linear 
projection of y, on a constant and x,. As long as there is some value for B such 
that y, — x; is 7(0), such a linear projection a + x;B exists, and OLS estimation 
of [19.2.22] should give a consistent estimate of this projection. 


What Is the Regression Estimating When There Is No 
Cointegrating Relation? 


We have seen that if there is at least one cointegrating relation involving y,,, 
then OLS estimation of [19.2.19] gives a consistent estimate of a cointegrating 
vector. Let us now consider the properties of OLS estimation when there is no 
cointegrating relation. Then [19.2.19] is a regression of an 1(1) variable on a set 
of (n — 1) I(1) variables for which no coefficients produce an J(0) error term. The 


19.2. Testing the Null Hypothesis of No Cointegration 591 


regression is therefore subject to the spurious regression problem described in 
Section 18.3. The coefficients &7 and 47 do not provide consistent estimates of any 
population parameters, and the OLS sample residuals d, will be nonstationary. 
However, this last property can be exploited to test for cointegration. If there is 
no cointegration, then a regression of d, on %,_, should yield a unit coefficient. If 
there is cointegration, then a regression of 2, on 2,_, should yield a coefficient that 
is less than 1. 

The proposal is thus to estimate [19.2.19] by OLS and then construct one of 
the standard unit root tests on the estimated residuals, such as the augmented 
Dickey-Fuller ¢ test or the Phillips Z, or Z, test. Although these test statistics are 
constructed in the same way as when they are applied to an individual series y,, 
when the tests are applied to the residuals a, from a spurious regression, the critical 
values that are used to interpret the test statistics are different from those employéd 
in Chapter 17. 

Specifically, let y, be an (m x 1) vector partitioned as 


Yur 
ax) [19.2.23] 
yy = y. 
(ax) 2 
(x1) 
for g = (mn — 1). Consider the regression 
Yu = at yyy + uy. [19.2.24] 


Let 2, be the sample residual associated with OLS estimation of [19.2.24] in a 
sample of size T: 


d= yy —4@r-— Fry forte=1,2,..., 7, (19.2.25] 


A t -1 
[*"] 7 T ZY2 | Wu 
Vr ZY ZY2Y2 ZYaYy 
and where > indicates summation over t from 1 to T. The residual #, can then be 


regressed on its own lagged value u,_, without a constant term: 


Q, = pt, + &, fort = 2,3,...,T, (19.2.26] 


where 


yielding the estimate 


T 
> u,_,4, 


br = =2-—. [19.2.27] 
> M1 
t—m2 
Let s} be the OLS estimate of the variance of e, for the regression of [19.2.26]: 
T 
sh = (T- 277° D (4, — bras), [19.2.28] 
t=2 
and let &,, be the standard error of #7 as calculated by the usual OLS formula: 


T é 
63. = sh + {> a}. [19.2.29] 


592 Chapter 19 | Cointegration 


Finally, let ¢,7 be the jth sample autocovariance of the estimated residuals asso- 
ciated with [19.2.26]: 


r 
ér=(T-1)71> ¢é-, forj=0,1,2,...,T7-2 [19.2.30] 


t=j+2 


for é, = 2, — frit,_,; and let the square of A; be given by 
q 
AR = Gor + 2°> [1 - GQ + Diepr, [19.2.31] 
j=l 


where q is the number of autocovariances to be used. Phillips’s Z, statistic (1987) 
can be calculated just as in [17.6.8]: 


Zp7 vi (T ina 1)(6r- 1) - (1/2) -{(T - 1)?-63, + s#}{A3 a €o,7}- [19.2.32] 


However, the asymptotic distribution of this statistic is not the expression in [17.6.8] 
but instead is a distribution that will be described in Proposition 19.4. 

If the vector y, is not cointegrated, then [19.2.24] will be a spurious regression 
and fr should be near 1. On the other hand, if we find that 6, is well below 1— 
that is, if calculation of [19.2.32] yields a negative number that is sufficiently large 
in absolute value—then the null hypothesis that [19.2.24] is a spurious regression 
should be rejected, and we would conclude that the variables are cointegrated. 

Similarly, Phillips’s Z, statistic associated with the residual autoregression 
[19.2.26] would be 


Zr = (€o,r/A})*+tp — (1/2)-{(T — 1)°65, + Sr}{A} - GorWAy  [19.2.33] 
for ty the usual OLS t statistic for testing the hypothesis p = 1: 
tr = (6r - 1)/6,,. 
Alternatively, lagged changes in the residuals could be added to the regression of 
[19.2.26] as in the augmented Dickey-Fuller test with no constant term: 
ty = f,A0,_1 + LAR» + +++ + -sAdyp4, + pay, +e, [19.2.34] 


Again, this is estimated by OLS fort = p + 1,p + 2,..., T, and the OLS t 
test of p = 1 is calculated using the standard OLS formula [8.1.26]. If this ¢ statistic 
or the Z, statistic in [19.2.33] is negative and sufficiently large in absolute value, 
this again casts doubt on the null hypothesis of no cointegration. 

The following proposition, adapted from Phillips and Ouliaris (1990), provides 
a formal statement of the asymptotic distributions of these three test statistics. 


Proposition 19.4: Consider an (n X 1) vector y, such that 
Ay, = > W,e,_; 


for ©, ani.i.d. sequence with mean zero, variance E(e,e/) = PP’, and finite fourth 
moments, and where {s-Y,}5.9 is absolutely summable. Letg =n ~ 1 and A = 
W(1)-P. Suppose that the (n X n) matrix AA' is nonsingular, and let L denote the 
Cholesky factor of (AA’)7?: 


(AA‘)7? = LL’, [19.2.35] 


19,2. Testing the Null Hypothesis of No Cointegration 593 


Then the following hold: 
(a) The statistic py defined in [19.2.27] satisfies 


- mwa zal [19.2.36] 


=5 it —ngLetdyotayane| | | + He 
2 


Here, W*(r) denotes n-dimensional standard Brownian motion partitioned 
as 


wr) 
W*(r) = axl) . 
oxy | Wa) 

(x1) 


h, is a scalar and h, a (g X 1) vector given by 


[f] = 1 i [W3(r)]' dr | Wi(r) dr 
bl lfweedr fiwseriwsor ar] [[wre-wie e| 


-1 


where the integral sign indicates integration over r from 0 to 1; and 
h 

H, = | (WIOR a - [wroa | wror-oweer ar|)*. 
2. 


(6) Ifq- ~ as T-> & but qiT — 0, then the statistic Z, 7 in [19.2.32] satisfies 


Ze 2.5 [19.2.37] 
where 


1 ; * “fw / } 
Za ( {t —h3}-[W*(1)]-(W"()] Bal (19.2.38] 


1 


1 
| —5 (1 + hih)p = H,. 
2. 


= miwror | 
(c) Ifq— © as T— & but qg/T — 0, then the statistic Z, 7 in [19.2.33] satisfies 
Z.7—> Z,-WH, + (1 + boh,). [19.2.39] 


(d) If, in addition to the preceding assumptions, Ay, follows a zero-mean stationary 
vector ARMA process and if p> ~ as T—> » but p/T'7—> 0, then the augmented 
Dickey-Fuller t test associated with [19.2.34] has the same limiting distribution 
Z,, as the test statistic Z, . described in [19.2.37]. 


Result (a) implies that 6,- 1. Hence, when the estimated “cointegrating” 
regression [19.2.24] is spurious, the estimated residuals from this regression behave 


594 Chapter 19 | Cointegration 


like a unit root process in the sense that if &, is regressed on %,_,, the estimated 
coefficient should tend to unity as the sample size grows. No linear combination 
of y, is stationary, and so the residuals from the spurious regression cannot be 
stationary. 

Note that since W}(r) and W3(r) are standard Brownian motion, the distri- 
butions of the terms h,, hj, H,, and Z, in Proposition 19.4 depend only on the 
number of stochastic explanatory variables included in the cointegrating regression 
(n — 1) and on whether a constant term appears in that regression but are not 
affected by the variances, correlations, and dynamics of Ay,. 

In the special case when Ay, is i.i.d., then W(L) = I, and the matrix AA’ = 
E[(Ay,)(Ay;)]. Since LL’ = (AA‘)“?, it follows that (AA’) = (L’)~1(L)~1. Hence, 
for this special case, 


L'{E[(Ay,)(Ay JL = L'(AA)L = L'(L)-*(L) “YL = 1. [19.2.40] 
If [19.2.40] is substituted into [19.2.36], the result is that when Ay, is i.id., 
(T - Yr - 1) 4 Z, 


for Z,, defined in [19.2.38]. 

In the more general case when Ay, is serially correlated, the limiting distri- 
bution of T(é; — 1) depends on the nature of this correlation as captured by the 
elements of L. However, the corrections for autocorrelation implicit in Phillips’s 
Z, and Z, statistics or the augmented Dickey-Fuller ¢ test turn out to generate 
variables whose distributions do not depend on any nuisance parameters. 

Although the distributions of Z,, Z,, and the augmented Dickey-Fuller ¢ test 
do not depend on nuisance parameters, the distributions when these statistics are 
calculated from the residuals 2, are not the same as the distributions these statistics 
would have if calculated from the raw data y,. Moreover, different values for n — 1 
(the number of stochastic explanatory variables in the cointegrating regression of 
[19.2.24]) imply different characterizations of the limiting statistics h,, h,, H,,, and 
Z,, Meaning that a different critical value must be used to interpret Z, for each 
value of n — 1. Similarly, the asymptotic distributions of h,, H,,, and Z, are different 
depending on whether a constant term is included in the cointegrating regression 
(19.2.24], 

The section labeled Case 1 in Table B.8 refers to the case when the cointe- 
grating regression is estimated without a constant term: 


Vue = YoYo + Yaar ttt + Yaar + Ue [19.2.41] 


The table reports Monte Carlo estimates of the critical values for the test statistic 
Z, described in [19.2.32], for a, the date t residual from OLS estimation of (19.2.41]. 
The values were calculated by generating a sample of size T = 500 for y,,, yo, 
. «+5 Yar independent Gaussian random walks, estimating [19.2.41] and [19,2.26] 
by OLS, and tabulating the distribution of (T — 1)(67 — 1). For example, the 
table indicates that if we were to regress a random walk y,, on three other random 
walks (y2,, ys, and y4,), then in 95% of the samples, (T — 1)(A7 — 1) would be 
greater than — 27.9, that is, 6, should exceed 0.94 in a sample of size T = 500. If 
the estimate #7 is below 0.94, then this might be taken as evidence that the series 
are cointegrated. 

The section labeled Case 2 in Table B.8 gives critical values for 2, 7 when a 
constant term is included i in the cointegrating regression: 


Vie = & + Yor + Yaar t+ 01° + YnVne + Ue [19.2.42] 
For this case, [19.2.26] is estimated with a, now interpreted as the residual from 


19.2. Testing the Null Hypothesis of No Cointegration 595 


OLS estimation of [19.2.42]. Note that the different cases (1 and 2) refer to whether 
a constant term is included in the cointegrating regression [19.2.42] and not to 
whether a constant term is included in the residual regression [19.2.26]. In each 
case, the autoregression for the residuals is estimated in the form of [19.2.26] with 
no constant term. 

Critical values for the Z, statistic or the augmented Dickey-Fuller ¢ statistic 
are reported in Table B.9. Again, if no constant term is included in the cointegrating 
regression as in [19.2.41], the case 1 entries are appropriate, whereas if a constant 
term is included in the cointegrating regression as in [19.2.42], the case 2 entries 
should be used. If the value for the Z, or augmented Dickey-Fuller ¢ statistic is 
negative and large in absolute value, this is evidence against the null hypothesis 
that y, is not cointegrated. 

When the corrections for serial correlation implicit in the Z,, Z,, or augmented 
Dickey-Fuller test are used, the justification for using the critical values in Table 
B.8 or B.9 is asymptotic, and accordingly these tables describe only the large- 
sample distribution. Small-sample critical values tabulated by Engle and Yoo (1987) 
and Haug (1992) can differ somewhat from the large-sample critical values. 


Testing for Cointegration Among Trending Series 
It was assumed in Proposition 19.4 that E(Ay,) = 0, in which case none of 
the series would exhibit nonzero drift. Bruce Hansen (1992) described how the 
results change if instead E(Ay,) contains one or more nonzero elements. 
Consider first the case n = 2, a regression of one scalar on another: 
Yu = At yyy + Uy. [19.2.43] 
Suppose that 
Aya = 8 + ux 
with 6, + 0. Then 
t 
Yor = Yao + 8y't + >) Urs, 


s=1 


which is asymptotically dominated by the deterministic time trend 6,-t. Thus, 
estimates @, and $7 based on OLS estimation of [19.2.43] have the same asymptotic 
distribution as the coefficients in a regression of an /(1) series on a constant and 
a time trend. If 


Ayu = 8+ Uy 


(where 6, may be zero), then the OLS estimate 7 based on [19.2.43] gives a 
consistent estimate of (6,/5,), and the first difference of the residuals from that 
regression converges to u,, — (5,/8)u2,; see Exercise 19.1. 

If, in fact, [19.2.43] were a simple time trend regression of the form 


Yu = at y+ uy, 
then an augmented Dickey-Fuller test on the residuals, 
i, = §,A@,_, + (AQ. + +++ + £,-AQ_pai+ ply + e;, {19.2.44] 
would be asymptotically equivalent to an augmented Dickey-Fuller test on the 
original series y,, that included a constant term and a time trend: 
Yue = GAY y+ LoAyiye tc + bp—1AYit—p+i 


[19.2.45] 
+a + pyir-1 + Bt + UY. 


596 Chapter 19 | Cointegration 


Since the residuals from OLS estimation of [19.2.43] behave like the residuals from 
a regression of [y,, — (6,/5,)y2,] on a time trend, Hansen (1992) showed that when 
yz has a nonzero trend, the ¢ test of p = 1 in [19.2.44] for a, the residual from 
OLS estimation of [19.2.43] has the same asymptotic distribution as the usual 
augmented Dickey-Fuller ¢ test for a regression of the form of [19.2.45] with y,, 
replaced by [y,, — (8,/52)y2,]. Thus, if the cointegrating regression involves a single 
variable y2, with nonzero drift, we estimate the regression [19.2.43] and calculate 
the Z, or augmented Dickey-Fuller ¢ statistic in exactly the same manner that was 
specified in equation [19.2.33] or [19.2.34]. However, rather than compare these 
statistics with the (n — 1) = 1 entry for case 2 from Table B.9, we instead compare 
these statistics with the case 4 section of Table B.6. 

For convenience, the values for a sample of size T = 500 for the univariate 
case 4 section of Table B.6 are reproduced in the (n — 1) = 1 row of the section 
labeled Case 3 in Table B.9. This is described as case 3 in the multivariate tabu- 
lations for the following reason. In the univariate analysis, “case 3” referred to a 
regression in which the single variable y, had a nonzero trend but no trend term 
was included in the regression. The multivariate generalization obtains when the 
explanatory variable y2, has a nonzero trend but no trend is included in the regres- 
sion [19.2.43]. The asymptotic distribution that describes the residuals from that 
regression is the same as that for a univariate regression in which a trend is included. 

Similarly, if y2, has a nonzero trend, we can estimate [19.2.43] by OLS and 
construct Phillips’s Z, statistic exactly as in equation [19.2.32] and compare this 
with the values tabulated in the case 4 portion of Table B.5. These numbers are 
reproduced in row (n — 1) = 1 of the case 3 section of Table B.8. 

More generally, consider a regression involving n — 1 stochastic explanatory 
variables of the form of [19.2.42]. Let 6, denote the trend in the ith variable: 


E(Ayn) = 8. 


Suppose that at least one of the explanatory variables has a nonzero trend com- 
ponent; for illustration, call this the nth variable: 


6, # 0. 


Whether or not other explanatory variables or the dependent variable also have 
nonzero trends turns out not to matter for the asymptotic distribution; that is, the 
values of 5,, &,..., 5,-, are irrelevant given that 6, + 0. 

Note that the fitted values of [19.2.42] are identical to the fitted values from 
OLS estimation of 


* 


Y= a" + yFyh + ySVE + + yeaaamae + VAYae + Ue [19.2.46] 


where 
Yt =Vie — (6/8) ne fori = 1,2,...,a-—1. 


As in the analysis of [18.2.44], moments involving y,, are dominated by the time 
trend 5,t, while the y# are driftless J(1) variables fori = 1,2,...,n — 1. Thus, 
the residuals from [19.2.46] have the same asymptotic properties as the residuals 
from OLS estimation of 


Yr = a* + ySyh + YBa + + yt Waaae + yada + up [19.2.47] 
The appropriate critical values for statistics constructed when 4, denotes the residual 
from OLS estimation of [19.2.42] can therefore be calculated from those for an 
OLS regression of an I(1) variable on a constant, (n — 2) other J(1) variables, and 


a time trend. The appropriate critical values are tabulated under the heading Case 
3 in Tables B.8 and B.9. 


19.2. Testing the Null Hypothesis of No Cointegration 597 


Of course, we could instead imagine including a time trend directly in the 
regression, as in 


Yur = &+ Wat Ysa tt + War + Ot + uy. [19.2.48] 


Since [19.2.48] is in the same form as the regression of [19.2.47], critical values for 
such a regression could be found by treating this as if it were a regression involving 
(n + 1) variables and looking in the case 3 section of Table B.8 or B.9 for the 
critical values that would be appropriate if we actually had (n + 1) rather than n 
total variables. Clearly, the specification in [19.2.42] has more power to reject a 
false null hypothesis than [19.2.48], since we would use the same table of critical 
values for [19.2.42] or [19.2.48] with one more degree of freedom used up by 
[19.2.48]. Conceivably, we might still want to estimate the regression in the form 
of [19.2.48] to cover the case when we are not sure whether any of the elements 
of y, have a nonzero trend or not. 


Summary of Residual-Based Tests for Cointegration 


The Phillips-Ouliaris-Hansen procedure for testing for cointegration is sum- 
marized in Table 19.1. 

To illustrate this approach, consider again the purchasing power parity ex- 
ample where p, is the log of the U.S. price level, s, is the log of the dollar-lira 
exchange rate, and p} is the log of the Italian price level. We have already seen 
that the vector a = (1, —1, —1)’ does not appear to be a cointegrating vector for 
y, = (p. 5,, p*)’. Let us now ask whether there is any cointegrating relation among 
these variables. 

The following regression was estimated by OLS for t = 1973:1 to 1989:10 
(standard errors in parentheses): 

D, = 2.71 + 0.051 s, + 0.5300 p* + @,. [19.2.49] 
(0.37) (0.012) (0.0067) 
The number of observations used to estimate [19.2.49] is T = 202. When the 
sample residuals @, are regressed on their own lagged values, the result is 
a, = 0.98331 a,_, + é, 
(0.01172) 
T 


s? = (T — 2)-! >) @ = (0.40374)? 


t= 


Co = 0.1622 


T 
= (Pai DS 044 


t=j+2 


> 
N 
Ul 


12 
é& + 2°>: [1 — (//13)]é, = 0.4082. 
j=l 


The Phillips-Ouliaris Z, test is 
Z, =(T -~ YG - 1) - GAT — 1)-6, + sp? - &) 
= (201)(0.98331 — 1) 
— 4{(201)(0.01172) + (0.40374)}(0.4082 — 0.1622) 
= —7.54. 


Given the evidence of nonzero drift in the explanatory variables, this is to be 
compared with the case 3 section of Table B.8. For (n — 1) = 2, the 5% critical 


598 Chapter 19 | Cointegration 


TABLE 19.1 : 
Summary of Phillips-Ouliaris-Hansen Tests for Cointegration 


Case 1: 
Estimated cointegrating regression: 


Yur = YaYa + Y3Ya t+ + Yaar + Uy 
True process for y, = (Yu Yars +. + > Yas)’: 
= > W.e,_, 
s=0 


Z, has the same asymptotic distribution as the variable described under the 
heading Case 1 in Table B.8. 

Z, and the augmented Dickey-Fuller t test have the same asymptotic distri- 
bution as the variable described under Case 1 in Table B.9. 


Case 2: 
Estimated cointegrating regression: 


Yu = & + YoYa + Vs¥a ttt + YaVar + Ue 
True process for y, = (Yur. Yor - + - > Vat)! 


om > W.e,_, 
s=0 
Z, has the same asymptotic distribution as the variable described under Case 
2 in Table B.8. 
Z, and the augmented Dickey-Fuller t¢ test have the same asymptotic distri- 
bution as the variable described under Case 2 in Table B.9. 


Case 3: 
Estimated cointegrating regression: 


Yu = &+ Yroyn + ¥3a¥a tt + YaYar + Ur 


True process for y, = (Yes Yars + + + s Yar)’ 


=s8+ > Wie,_; 


s=0 


with at least one element of 5,, 6;,... , 5, nonzero. 

Z, has the same asymptotic distribution as the variable described under Case 
3 in Table B.8. 

Z, and the augmented Dickey-Fuller ¢ test have the same asymptotic distri- 
bution as the variable described under Case 3 in Table B.9, 


Notes to Table 19.1 

Estimated cointegrating regression indicates the form in which the regression that could describe 
the cointegrating relation is estimated, using observations t = 1,2,...,T. 

True process describes the null hypothesis under which the distribution i is calculated. In each 
case, @, is assumed to be i.i.d. with mean zero, positive definite variance-covariance matrix, and finite 
fourth moments, and the sequence {s-W,}>., is absolutely summable. The matrix W(1) is assumed to 
be nonsingular, meaning that the vector y, is not cointegrated under the null hypothesis. If the test 
statistic is below the indicated critical value (that is, if Z,, Z,, or ¢ is negative and sufficiently large in 
absolute value), then the nuil hypothesis of no cointegration is rejected. 

Z, is the following statistic, 


Z, @ (T - Ir - 1) - (12){(T - 163, + sH(AF - C07), 


where fr is the estimate of p based on OLS estimation of 2, = pé,_, + e, for a, the OLS sample residual 


19.2. Testing the Null Hypothesis of No Cointegration 599 


value for Z, is — 27,1. Since —7.54 > —27.1, the null hypothesis of no cointegration 
is accepted, Similarly, the Phillips-Ouliaris Z, statistic is 


Z, = (éo/A2)(6 — 1/6, — (12)(T — 1)-6, + s}(A? — &)/A 
{(0.1622)/(0,4082)}2(0.98331 — 1)/(0,01172) 


— 4{(201)(0.01172) + (0.40374)}(0.4082 — 0.1622)/(0,4082)!2 
— 2.02, 


Comparing this with the case 3 section of Table B.9, we see that — 2.02 > —3.80, 
so that the null hypothesis of no cointegration is also accepted by this test. An 
OLS regression of &, on 2,_, and twelve lags of Az,_; produces an OLS t test of 
p = 1 of —2.73, which is again above — 3.80, We thus find little evidence that p,, 
S,, and p/ are cointegrated. Indeed, the regression [19.2.49] displays the classic 
symptoms of a spurious regression —the estimated standard errors are small relative 
to the coefficient estimates, and the estimated first-order autocorrelation of the 
residuals is near unity. 

As a second example, Figure 19.5 plots 100 times the logs of real quarterly 
aggregate personal disposable income (y,) and personal consumption expenditures 
(c,) for the United States over 1947:I to 1989: III. In a regression of y,on aconstant, 
a time trend, y,_,, and Ay,_;forj = 1,2,... ,6, the OLSt test that the coefficient 
on y,.; is unity is —1.28, Similarly, in a regression of c, on a constant, a time 
trend, c,_,, and Ac,_, for j = 1,2,... , 6, the OLS t test that the coefficient on 
c,., is unity is — 1.88. Thus, both processes might well be described as (1) with 
positive drift. 

The OLS estimate of the cointegrating relation is 


c, = 0.67 + 0.9865 y, + u, [19.2.50] 
(2.35) (0.0032) 


A first-order autoregression fitted to the residuals produces 
a, = 0,782 @,_, + @, 
(0.048) 


Notes to Table 19,1 (continued). 
from the estimated regression. Here, 
T 
sh = (T- 2)-'D 2, 
ery 


where é, = @, — frd,_, is the sample residual from the autoregression describing @, and dy, is the 
standard error for #; as calculated by the usual OLS formula; 


7, 
53, = sh + 2a 
Also, 


T 
Gr = (T- 1) D te, 


Mm bs +23 [ia + Dir 
Z, is the following statistic: 
Z, ™ (a. r/A3)Y*(Br — 1/6, — (12)(AF - Cor)(WArH(T — 1)-65, + Sr}. 


Augmented Dickey-Fuller t statistic is the OLS t test of the nuil hypothesis that p = 1 in the 
Tegression 


Q, = GA, + $2AO to + Atl pe + pai t e, 


600 Chapter 19 | Cointegration 


780 


760 


740 


720 


700 


640 


47 Si i) $9 63 67 7 75 73 83 87 


FIGURE 19.5 One hundred times the log of personal consumption expenditures 
(c,) and personal disposable income (y,) for the United States in billions of 1982 
dollars, quarterly, 1947-89. Key: C3 --—- yy 


for which the corresponding Z, and Z, statistics for g = 6 are —32.0 and — 4.28. 
Since there is again ample evidence that y, has positive drift, these are to be 
compared with the case 3 sections of Tables B.8 and B.9, respectively. Since 
—32.0 < —21,5 and —4,28 < —3.42, in each case the null hypothesis of no 
cointegration is rejected at the 5% level. Thus consumption and income appear to 
be cointegrated. 


Other Tests for Cointegration 


The tests that have been discussed in this section are based on the residuals 
from an OLS regression of y,, on (ya, Yy5 +>» » Yn)» Since these are not the same 
as the residuals from a regression of y2, on (yy, Ya >» + » Yne)» the tests can give 
different answers depending on which variable is labeled y,. Important tests for 
cointegration that are invariant to the ordering of variables are the full-information 
maximum likelihood test of Johansen (1988, 1991) and the related tests of Stock 
and Watson (1988) and Ahn and Reinsel (1990). These will be discussed in Chapter 
20. Other useful tests for cointegration have been proposed by Phillips and Ouliaris 
(1990), Park, Ouliaris, and Choi (1988), Stock (1990), and Hansen (1990). 


19.3. Testing Hypotheses About the Cointegrating Vector 


The previous section described some ways to test whether a vector y, is cointegrated. 
It was noted that if y, is cointegrated, then a consistent estimate of the cointegrating 


19,3, Testing Hypotheses About the Cointegrating Vector 601 


vector can be obtained by OLS, This section explores further the distribution theory 
of this estimate and proposes several alternative estimates that simplify hypothesis 
testing, 


Distribution of the OLS Estimate for a Special Case 

Let y,, be a scalar and y,, be a (g x 1) vector satisfying 
Yu = at y'yo, + 2? [19.3.1] 
Yor = Yoe-1 + Un- [19.3.2] 


If y,, and yz, are both J(1) but z7 and uz, are (0), then, for n = (g + 1), the n- 

dimensional vector (y,,, y3,)' is cointegrated with cointegrating relation [19.3.1]. 
Consider the special case of a Gaussian system for which y,, follows a random 

walk and for which z} is white noise and uncorrelated with u,, for all ¢t and 7: 


Ki eke (BI kK Mal [19.3.3] 


Then [19,3.1] describes a regression in which the explanatory variables (y.,) are 
independent of the error term (z*) for all t and 7. The regression thus satisfies 
Assumption 8,2 in Chapter 8, There it was seen that conditional on (ys, ya,. +> 5 
Yar), the OLS estimates have a Gaussian distribution: 


feel aa 
NES ae LY, ZY2eV2e Luz : [19.3.4] 


(Yr - ¥) ) 
0 T Zy3 
as , oil i | ) 
(Hl : ZY L¥u¥x 


where = indicates summation over t from 1 to T, 

Recall further from Chapter 8 that this conditional Gaussian distribution is 
all that is needed to justify small-sample application of the usual OLS t or F tests. 
Consider a hypothesis test involving m restrictions on a and y of the form 


R,a + R,y =r, 


where R, and r are known (m x 1) vectors and R, is a known (m X g) matrix 
describing the restrictions, The Wald form of the OLS F test of the null hypothesis 
is 


. T Sy, | PRA] 
easton) {ste Ral ye el Fal [19.3.5] 
x (Rady + RiYr — vr) + mM, 
where 
T 
sp =(T- nn)" = Qu - 47 — YrYx)’. 
Result [19.3.4] implies that conditional on (ya, Ya, .++.» Yer), under the null 


hypothesis the vector (R,@, + R,#7r — r) has a Gaussian distribution with mean 


0 and variance 
-1 
T > t i 
o7[R, R,]| ia ee 
Za LY2Y2 R, 


602 Chapter 19 | Cointegration 


It follows that conditional on (Y2:, yz, +...» Yar), the term 
-1 -1 
T ly R, 
R,é7 + Ry¥r - 1)’ jo3[R, R 
(R.ér vr — 1) { i a, eel Fal 
x (R47 + Ryfr — rn) 


[19.3.6] 


is a quadratic form in a Gaussian vector. Proposition 8,1 establishes that conditional 
on (¥o1, Yo.» +» » Yar), the magnitude in [19.3.6] has a x?(m) distribution. Thus, 
conditional on (y2;, Y22,. + - » Yar), the OLS F test [19.3.5] could be viewed as the 
ratio of a y?(m) variable to the independent y?(T — n) variable (T — n)s3/o?, 
with numerator and denominator each divided by its degree of freedom. The OLS 
F test thus has an exact F(m, T — n) conditional distribution, Since this is the 
same distribution for all realizations of (y2,, Yo, . . . , Yar), it follows that [19.3.5] 
has an unconditional F(m, T — n) distribution as well. Hence, despite the J(1) 
regressors and complications of cointegration, the correct approach for this example 
would be to estimate [19.3.1] by OLS and use standard t or F statistics to test any 
hypotheses about the cointegrating vector. No special procedures are needed to 
estimate the cointegrating vector, and no unusual critical values need be consulted 
to test a hypothesis about its value. 

We now seek to make an analogous statement in terms of the corresponding 
asymptotic distributions, To do so it will be helpful to rescale the results in [19.3.4] 
and [19.3.5] so that they define sequences of statistics with nondegenerate asymp- 
totic distributions. If [19.3.4] is premultiplied by the matrix 


Ti2 0 
0 TI) 
the implication is that the distribution of the OLS estimates conditional on (y2,, 


Yu. +++» Yar) is given by 


Bes — a) 
TAr- ¥) 


Ol, Shire ae sre ae 0 
~([a} [7 lee eis F sh) [19.3.7] 


0 1 T-323ys,] 7" 
=N , or : 
0 T~ yo, TW? Lynx 


To analyze the asymptotic distribution, notice that [19,3.1] through [19.3.3] 
are a special case of the system analyzed in Proposition 19.2 with W*(L) = I, and 


EE 
~ 10 PI’ 


where P,, is the Cholesky factor of 222: 


(Yai, Yaar» + + > sa) 


OQ» = P,P. 
For this special case, 
oO; 0' 
w*(1)-P = : 19.3.8 
ae=[% 2] 19.3.8] 


19,3. Testing Hypotheses About the Cointegrating Vector 603 


The terms A}' and Aj referred to in Proposition 19.2 would then be given by 
AT! = | a, 
(xn) GQx1l) (xg) 
At [ 0 Py 
(g Xn) x) (xa) |" 
Thus, result [19.2.13] of Proposition 19,2 establishes that 
ee - ) . 1 || T2527 
T@r-y) J LTH33yx T-*Syaye) LT-'2yzzt 
1 {fewer ar}] 4 
L Px 
4 


| 2 aie 
@ ralf wind 0 Pal [twoHwo ar\| © 
[o, wa) 


"\(0 Pal f wontewonr}| %| ee 


where the integral sign indicates integration over r from 0 to 1. If the n-dimensional 
standard Brownian motion W(r) is partitioned as 


Wir 
wr) = | axn 
ao W.(r) 

(gx1) 


then [19.3.9] can be written 
bes - >| 
Tr - ¥) 


1 { | [W2(r)]' ares 
P. | W.(r) dr Pal | [W.2(7)]}-[W2(7)]' arb [19.3,10] 


o,W,(1) 
e [ ww. wor 


L 
> 


where 


p 1 {f [W.(r)]' ar} 
Pa | W.(r) dr Pal | [W207] [W2(7)]' arb [19.3.11] 
W,(1) 


7 |r | [w2(9)] ono}} 


604 Chapter 19 | Cointegration 


Since W,(:) is independent of W,(-), the distribution of (»,, v3)’ conditional 
on W,(-) is found by treating W.(r) as a deterministic function of r and leaving the 
process W,(-) unaffected. Then {[W,(r)] dW,(r) has a simple Gaussian distribution, 
and [19,3.11] describes a Gaussian vector, In particular, the exact finite-sample 
result for Gaussian disturbances [19.3.7] implied that 


bes - a) Gots, | a [ 1 eal ee 
Tr -y) pre T-32Zy,, T-Xynyi,] LT Dya27 


0 1 9 - T-#yy,, 17 
~N( |. |, 0 
(8 es ee 


Comparing this with the limiting distribution [19.3.10], it appears that the vector 
(14, ¥2)' has distribution conditional on W,(:) that could be described as 


k 
a A 1 { i [W.(r)]' ar}, 


"Fen f wade Pal f EWAOHWAOT ar} 


Expression [19,3,12] allows the argument that was used to motivate the usual 
OLS t and F tests on the system of [19.3.1] and [19.3.2] with Gaussian disturbances 
satisfying [19.3.3] to give an asymptotic justification for these same tests in a system 
with non-Gaussian disturbances whose means and autocovariances are as assumed 
in [19.3.3], Consider for illustration a hypothesis that involves only the cointegrating 
vector, so that R, = 0, Then, under the null hypothesis, m times the F test in 
[19.3.5] becomes 


-1 -1 
mF = Rycae — what Rr | Bi Ryn - »)] 


- yr ofa ants, os) [2x] 
x [Ry T@r — ¥)] 
Ti2 0' aah 
esi ron mai([ 0 hal 
-1\ -1 =| 
T yz, || T? 0° 0" Pee 
i All| 0 ba ) [e.] [Ry T@r- y)] 


4, [R,o,v2]'(s}) 714 [0 R,] 


1 { | [W.(r)]' ar} E 
x : [R,o,¥2] 
Pa WO) dr Pad [ (WOW! arlns, | 


-1 


(19.3,12] 


-1 


19.3, Testing Hypotheses About the Cointegrating Vector 605 


-1 -1 


as le Ry] 


{fw.cor' arbre 


0' 
x | [R,v2]. 
af wainar Pal [ Ew.coHWaor'arfes, | LS 


[19.3.13] 


Result [19.3.12] implies that conditional on W,(-), the vector R,v, has a Gaussian 
distribution with mean 0 and variance ‘ 


{f war arb 
Pa | Wan dr Pal { OwaoHtwacor are 


Since s} provides a consistent estimate of o7, the limiting distribution of m:F, 
conditional on W,(-) is thus x7(m), and so the unconditional distribution is y?(m) 
as well. This means that OLS t or F tests involving the cointegrating vector have 
their standard asymptotic Gaussian or x? distributions. 

It is also straightforward to adapt the methods in Section 16.3 to show that 
the OLS x? test of a hypothesis involving just a, or that for a joint hypothesis 
involving both a and ¥, also has a limiting y? distribution. 

The analysis to this point applies in the special case when y,, and y,, follow 
random walks. The analysis is easily extended to allow for serial correlation in 
zf OF U,,, aS long as the critical condition that zf is uncorrelated with u,, for all 
t and 7+ is maintained. In particular, suppose that the dynamic process for 
(z7, uy)’ is given by 


-1 


fo RJ 


k = W(L)e, 


Ux 
with {s:-W*}*_, absolutely summable, E(e,) = 0, E(e,e;) = PP’ if t= 7 and 0 


otherwise, and fourth moments of ¢, finite. In order for zf to be uncorrelated with 
u,, for all t and 7, both &*(L) and P must be block-diagonal: 


fen oF 
vw-| 0 ed 


o, 0 
P= ’ 
0 Py 
implying that the matrix W*(1)-P is also block-diagonal: 


ae [owh® 
wor =[ 0 ee 


heel 


606 Chapter 19 | Cointegration 


[19.3.14] 


Noting the parallel between [19.3.14] and [19.3.8], it is easy to confirm that if 
Af # Oand the rows of Aj, are linearly independent, then the analysis of [19.3.10] 
continues to hold, with o, replaced by Af and P,2 replaced by Aj: 


~1 


prir- 9) {fewer ar}as 
TA, - 
G91 bas [wna anf fowsorwsorartas | 9515 
arm() 
x . 
sel Jomo ole 


Conditional on W,(:), this again describes a Gaussian vector with mean zero and 


variance 
{frwsor fas 


Ab | War) ar An { worw.er ar}as 


-1 


(ary? 


The same calculations as in [19.3.13] further indicate that m times the OLS F test 
of m restrictions involving a or y converges to (Af)*/s} times a variable that is 
x*(m) conditional on W,(:). Since this distribution does not depend on W,(:), the 
unconditional distribution is also [(Af)7/s3]- (mn). 

Note that the OLS estimate s3 provides a consistent estimate of the variance 
of z/: 


T 
se (T- 2)" 2 (ue ~ Gr — 592)? > EGrY. 


However, if z* is serially correlated, this is not the same magnitude as (Af)*. 
Fortunately, this is simple to correct for. For example, s} in the usual formula for 
the F test [19.3.5] could be replaced with 


(it)? = dor + 2-3 (1 — jg + Dier [19.3.16] 
£ 
for 
T 
Gr=2T at; [19.3.17] 
t=j+1 


with 2, = (yy, — & — ¥ry2,) the sample residual resulting from OLS estimation 
of [19.3.1]. If g — © but g/T — 0, then Af — Af. It then follows that the test 
Statistic given by 


-1 -1 
(R,é7 + Riir - ov{ tate | T Lyx [e]| 


2x Zyayad LR, [19.3.18] 


x (R,ér + Ryfr — 1) 
has an asymptotic y?(m) distribution. 


19.3. Testing Hypotheses About the Cointegrating Vector 607 


The difficulties with nonstandard distributions for hypothesis tests about the 
cointegrating vector are thus due to the possibility of nonzero correlations between 
zf and u,,. The basic approach to constructing hypothesis tests will therefore be 
to transform the regression or the estimates so as to eliminate the effects of this 
correlation. 


Correcting for Correlation by Adding Leads 
and Lags of Ay, 


One correction for the correlation between zf and u,,, suggested by Saikkonen 
(1991), Phillips and Loretan (1991), Stock and Watson (1993), and Wooldridge 
(1991), is to augment [19.3.1] with leads and lags of Ay,,. Specifically, since z* and 
u,, are stationary, we can define Z, to be the residual from a linear projection of 
ze on {ty ¢-p> Un¢-p+1> see y Une—ay Ways Wargi, sss Wa p+ ph! 


t= 3 Bim + 2 
s=-p 
where 2, by construction is uncorrelated with u,,_,fors = -—p, -p+1,..., 
p. Recalling from [19.3.2] that u,, = Ay.,, equation [19.3.1] then can be written 


P 
Vie = a+ VYa + sy BsAy2e-s + Z,. [19.3.19] 
s=—p 


If we are willing to assume that the correlation between z7 and u,,_, is zero for 
|s| > p, then an F test about the true value of y that has an asymptotic y? distribution 
is easy to construct using the same approach adopted in [19.3.18]. 

For a more formal statement, let y,, and y,, satisfy [19.3.19] and [19.3.2] with 


Uy, s=0 


where {s-W,}=_, is an absolutely summable sequence of (mn x n) matrices and 
{e}%. 2 is ani.i.d. sequence of (mn x 1) vectors with mean zero, variance PP’, and 
finite fourth moments and with W(1)-P nonsingular. Suppose that Z, is uncorre- 
lated with u,, for all ¢ and 7, so that 


_|o, 9 
cas i a [19.3.20} 
rf h(L) 0' 
¥(L) = : 19.3.21 
0 .(L) [ ] 
where P,, and W,,(L) are (g x g) matrices for g =n — 1. Define 
Wes (Uae W4-p+i> eee y Way, Woe, Wregtr ees Wrep)' 
B = (B,,B,-1,---, Bip)’, 


so that the regression model [19.3.19] can be written 
Yue = Bw, + w+ y' yu + %. [19.3.22] 


The reader is invited to confirm in Exercise 19.2 that the OLS estimates of [19.3.22] 


608 Chapter 19 | Cointegration 


satisfy 
TB, — B) ‘ Q-'h, 
Ta, — a} >] Any, |, [19.3.23] 
T(¥r — ¥) Ayv, 

where Q = E(w,w!), T~*?5w,Z,— hy, Ay, = 0,°§,,(1), and 

(>) 1 { | [W.(r)}" aris 

Aaa | WAC”) dr haa f [W2(r)}-[W2(r)]' irl 

W,(1) 
haf | [W.(r)] aw, o} 


Here An = W(1) Pro, W,(r) is univariate standard Brownian motion, W,(r) is 
g-dimensional standard Brownian motion that is independent of W,(-), and the 
integral sign denotes integration over r from 0 to 1. Hence, as in [19.3.12], 


K 
ue H 1 { i [Wa(r)]’ arhiv 


O|’| - = ~ 
fn W.0) dr Aaaf f Own twa0nr' arc 


W209 
[19.3.24] 


-1 


Moreover, the Wald form of the OLS x? test of the null hypothesis Ry = r, 
where R, is an (m x g) matrix andr is an (m x 1) vector, can be shown to satisfy 


-1 


xt = {Ry Fr > r}’ s7[0 0 R,] =w, T Ls, 


=w,w; lw, LwyYx i 0 
Zy2W, LY2 DY 292 


x {Ry Yr - 1} 


+ (A2,/s})[R,v2]'{ [0 Ry] 


1 {fovsor ar} ‘ 
x $ Ie ’ [R,,v2] > 
Az | W,(r) dr hna{ [W.)}[W20)} arhiv “y 


[19.3.25] 


19.3. Testing Hypotheses About the Cointegrating Vector 609 


see Exercise 19.3. But result [19.3.24] implies that conditional on W,(:), the expres- 
sion in [19.3.25] is (A?,/s%) times a y?(m) variable. Since this distribution is the 
same for all W,(:), it follows that the unconditional distribution also satisfies 


x4 > (Ak/s})-x7(m). [19.3.26] 


Result [19.3.26] establishes that in order to test a hypothesis about the value 
of the cointegrating vector y, we can estimate [19. 3.19] by OLS and calculate a 
standard F test of the hypothesis that R,y = r using the usual formula. We need 
only to multiply the OLS F statistic by a consistent estimate of (s3/A2,), and the 
F statistic can be compared with the usual F(m, T — k) tables for k the number 
of parameters estimated in [19.3.19] for an asymptotically valid test. Similarly, the 
OLS t statistic could be multiplied by (s3/A?,)¥ and compared with the standard 
t tables. 

A consistent estimate of A}, is euey to obtain. Recall ‘that A,, = o: d(1), 
where Z, = ,(L)e,, and E(e},) = o?. Suppose we approximate #,,(L) by an 
AR(p) process, and let %, denote the sample residual resulting from OLS estimation 
of [19.3.19]. If 2, is regressed on p of its own lags: 


a, > dit,_; + dolt,_2 a eae dplt,-p ft e, 
then a natural estimate of Aj, is 


1,= 6/0 - dy -d-ct+H bp); [19.3.27] 
where 
T 


éi= (T- p) > # 
t=p+l 
and where T indicates the number of observations actually used to estimate [19.3.19]. 
Alternatively, if the dynamics implied by bu(L) were to be approximated on the 
basis of g autocovariances, the Newey-West estimator could be used: 


i, = & +23 1 - HQ + DE, (19.3.28 
j= 
where 
fe i s At, -;. 
t=j+1 


These results were derived under the assumption that there were no drift 
terms in any of the elements of y,,. However, it is not hard to show that the same 
procedure works in exactly the same way when some or all of the elements of y., 
involve deterministic time trends. In addition, there is no problem with adding a 
time trend to the regression of [19.3.19] and testing a hypothesis about its value 
using this same factor applied to the usual F test. This allows testing separately 
the hypotheses that (1) y;, — ‘y’Y2, has no time trend and (2) y,, — y’y2, is 1(0), 
that is, testing separately the restrictions [19.1.15] and [19.1.12]. The reader is 
invited to verify these claims in Exercises 19.4 and 19.5. 


Illustration— Testing Hypotheses About the Cointegrating 
Relation Between Consumption and Income 


As an illustration of this approach, consider again the relation between con- 
sumption c, and income y,, for which evidence of cointegration was found earlier. 


610 Chapter 19 | Cointegration 


The following regression was estimated for t = 1948:II to 1988:III by OLS, with 
the usual OLS formulas for standard deviations given in parentheses: 


c, = —4.52 + 0.99216 y, + 0. 15 Ayrng + ees Ayi43 + u26 AY, 43 
(2.34) (0.00306) (0.12 
+ 0.49 Ay,,, — 0.24 Ay, — 0.01 Ay,_, + OO Ye 7 
(0.12) (0.12) (0.11) [19.3.29] 
+ 0.04 Ay,_3 + 0.02 Ay,_, + a, ; 
(0.11) (0.11) 
T 
s? = (T - 11)? a? = (1.516). 
t=1 


Here T, the number of observations actually used to estimate [19.3.29], is 162. To 
test the null hypothesis that the cointegrating vector is a = (1, —1)’, we start with 
the usual OLS t test of this hypothesis, 


t = (0.99216 ~ 1)/0.00306 = —2.562. 
A second-order autoregression fitted to the residuals of [19.3.29] by OLS produced 
2, = 0.7180 @,_,+ 0.2057 #,_,4+ &, [19.3.30] 


where 
T 
= (T — 2)-1 >, é& = 0.38092. 
t=3 


Thus, the estimate of A,, suggested in [19.3.27] is 
Ary = (0.38092)'2/(1 — 0.7180 ~ 0.2057) = 8.089. 
Hence, a test of the null hypothesis that a = (1, —1)' can be based on 
t-(s/Ay:) = (—2.562)(1.516)/(8.089) = —0.48. 


Since —0.48 is above the 5% critical value of —1.96 for a N(0, 1) variable, we 
accept the null hypothesis thata = (1, —1)’. 

To test the restrictions implied by cointegration for the time trend and sto- 
chastic component separately, the regression of [19.3.29] was reestimated with a 
time trend included: 


c, = 198.9 + 0.6812 y, + 0.2690 t + 0.03 Ay,,4 + 0.17 AYr4s 


(15.0) (0.0229) (0.0197) (0.08) 
+ 0.15 A ee ds ae ree 

eos >? se ey ae [19.3.31] 
+ 0.23 Ay,-, + 020 Ay. + 0.19 Ay, eh 

(0.08) 


g = (T - 12)" > w= (1. ory 


A second-order autoregression fitted to the residuals of [19.3.31] produced 
a, = 0.6872 &,.,+ 0.1292 &,.+ é,, 
where 
T 
63 = (T — 2)-! > & = 0.34395 
t=3 
and 
Ai = (0.34395)42/(1 — 0.6872 — 0.1292) = 3.194. 


19.3. Testing Hypotheses About the Cointegrating Vector 611 


A test of the hypothesis that the time trend does not contribute to [19.3.31] is thus 
given by 


[(0.2690)/(0.0197)}-[(1.017)/(3.194)}] = 4.35. 


Since 4.35 > 1.96, we reject the null hypothesis that the coefficient on the time 
trend is zero. 

The OLS results in [19.3.29] are certainly consistent with the hypothesis that 
consumption and income are cointegrated with cointegrating vector a = (1, —1)'. 
However, [19.3.31] indicates that this result is dominated by the deterministic time 
trend common to c, and y,. It appears that while a = (1, —1)’ is sufficient to 
eliminate the trend components of c, and y,, the residual c, — y, contains a stochastic 
component that could be viewed as I(1). Figure 19.6 provides a plot of c,— y,. It 
is indeed the case that this transformation seems to have eliminated the trend, 
though stochastic shocks to c, — y, do not appear to die out within a period as 
short as 2 years. 


Further Remarks and Extensions 


It was assumed throughout the derivations in this section that 2, is 1(0), so 
that y, is cointegrated with the cointegrating vector having a nonzero coefficient 
on y,,. If y, were not cointegrated, then [19.3.19] would be a spurious regression 
and the tests that were described would not be valid. For this reason estimation 
of [19.3.19] would usually be undertaken after an initial investigation suggested 
the presence of a cointegrating relation. 


“1.6 


“11.2 


-12.8 


-14.4 TOTTI TTT 
47 S! ss sg 63 67 71 7s 79 83 87 


FIGURE 19.6 One hundred times the difference between the log of personal 
consumption expenditures (c,) and the log of personal disposable income (y,) for 
the United States, quarterly, 1947-89. 


612 Chapter 19 | Cointegration 


It was also assumed that A., is nonsingular, meaning that there are no coin- 
tegrating relations among the variables in y,,. Suppose instead that we are interested 
in estimating h > 1 different cointegrating vectors, as represented by a system of 
the form 


yr. = T'-y, + pt + zt [19.3.32] 
(x1) (Axg) (gX1) (Ax1) (Ax 1) 
Ay, = 82 + Uy [19.3.33] 


(x1) (gx) (gx1) 
with 


= ¥*(L)e, 


= 
oc oN 
yooTe 
cys 
I 


and W*(1) nonsingular. Here the generalization of the previous approach would 
be to augment [19.3.32] with leads and lags of Ay,,: 


Yu = wt + T'yy + D Bibvaens + 2, [19.3.34] 
where B{ denotes an (A X g) matrix of coefficients and it is assumed that Z, is 
uncorrelated with u,, for all t and r. Expression [19.3.34] describes a set of h 
equations. The ith equation regresses y,, on a constant, on the current value of all 
the elements of y,,, and on past, present, and future changes of all the elements 
of y,,. This equation could be estimated by OLS, with the usual F statistics mul- 
tiplied by [s@/A@]?, where s@ is the standard error of the regression and A{) could 
be estimated from the autocovariances of the residuals Z,, for the regression. 

The approach just described estimated the relation in [19.3.19] by OLS and 
made adjustments to the usual f or F statistics so that they could be compared with 
the standard ¢ and F tables. Stock and Watson (1993) also suggested the more 
efficient approach of first estimating [19.3.19] by OLS, then using the residuals to 
construct a consistent estimate of the autocorrelation of u, as in [19.3.27] or [19.3.28], 
and finally reestimating the equation by generalized least squares. The resulting 
GLS standard errors could be used to construct asymptotically x? hypothesis tests. 

Phillips and Loretan (1991, p. 424) suggested that instead autocorrelation of 
the residuals of [19.3.19] could be handled by including lagged values of the residual 
of the cointegrating relation in the form of 


P 
Yu = a + YY + Dy Bs Ay2.-5 + > Os(Yie-s — Y'Y24-s) + e,.  [19.3.35] 
p s= 


s=— 


Their proposal was to estimate the parameters in [19.3.35}] by numerical minimi- 
zation of the sum of squared residuals. 


Phillips and Hansen’s Fully Modified OLS Estimates 


A related approach was suggested by Phillips and Hansen (1990). Consider 
again a system with a single cointegrating relation written in the form 


Yn = at yyy +z [19.3.36] 
Aya, = Uy [19.3.37] 
* 
[7 = W*(L)e, 
U2, 
E(e,£;) = PP’, 


19.3. Testing Hypotheses About the Cointegrating Vector 613 


where y2, is a (g X 1) vector and e, is an (nm X 1) i.i.d. zero-mean vector for 
n = (g + 1). Define 


At = W*(1)-P 
Sh 2h 
S* SsAt Athy = | OxD Gxa) , 19.3.3 
(xn) [ J De > [ ‘ 8} 
(gx!) (exa) 


with A* as always assumed to be a nonsingular matrix. 
Recail from equation [10.3.4] that the autocovariance-generating function for 
(z*, u3,)' is given by 


eay= 5 eae piece 
cater E(uz,z/_,) E(uza3 .-v) 
= [W*(z)]-PP'[W*(z-*)]'. 
Thus, 5* could alternatively be described as the autocovariance-generating function 
G(z) evaluated at z = 1: 


i | as feeders ey 
=i 2h E(uy,z¢y) E(uy,03,¢-») 


The difference between the general distribution for the estimated cointe- 
grating vector described in Proposition 19.2 and the convenient special case in- 
vestigated in [19.3.15] is due to two factors. The first is the possibility of a nonzero 
value for 23,, and the second is the constant term that might appear in the variable 
h, described in Proposition 19.2 arising from a nonzero value for 


[19.3.39] 


ven 


X= > E(u,,z%,,). [19.3.40] 
ved 
The first issue can be addressed by subtracting 23/(23,) ~‘Ay,, from both sides 
of [19.3.36], arriving at 
Ye = a+ Y'¥y + ZI, 
where 
yh = yu — DH (SHE) Aya [19.3.41] 
zp = zt — Bh (22) *Aya. 
Notice that since Ay., = u,,, the vector (z}, uj,)’ can be written as 


fe = L! ik [19.3.42] 


for 
“58 )-1 e 
pelt 722) |- am | [19.3.43] 
0 I, L 
(gxn) 


Suppose we were to estimate a and y with an OLS regression of y!, on a constant 


and y,,: 
-1 
i 2 | Bae 
a ; 19.3.44 
is Syn Eraid LSravt poe 


614 Chapter 19 | Cointegration 


The distribution of the resulting estimates is readily found from Proposition 19.2. 
Note that the vector A*’ used in Proposition 19.2 can be written as e;A* for e; the 
first row of I,,, while the matrix AZ in Proposition 19.2 can be written as LjA* for 
Lj the last g rows of L'. The asymptotic distribution of the estimates in [19.3.44] 
is found by writing Af in [19.2.13] as LiA*, replacing Af’ = e{A* in [19.2.13] with 
€:A*, and replacing E(u,,z7,,) with E(u,,z},,): 


T!2(4.— a) ] _ i T-323y3, ie T-23zt 
TH -—y) | LT~32yn T-?*3 yay T~'Sy2,2f 


-1 


1 {f [W(r)]’ alan, 


Las { wer) ar iad [ wero’ arhant, 
eiA*W(1) 
x ig [owen jawonr' bane + dl 
[19.3.45] 


where W(r) denotes n-dimensional standard Brownian motion and 
RT = >) E (uaz!) 
va0 
= > Hualetey — Ei (Sh) eed [19.3.46] 


ie 1 
= 2, Eluzl[ze. wise ee iaea 
Now, consider the (mn x 1) vector process defined by 
B(r) = [tia we. [19.3.47] 
L, 


From [19.3.43] and [19.3.38], this is Brownian motion with variance matrix 


f 


E{(B()}-[B()!} = [fi Jarante Li] 


a tae mel | all 1 fl 
0 I, 2h 2h —(2%)7 123, I, 


. [etP A 
~ 1 0 ss) 
[19.3.48] 


where 
(of)? = 2%, — TH/(2h) Th. [19.3.49] 


Partition B(r) as 


B,(r) sted 
= qa~xi1) = . 
El een frets 


@x1) 


19.3. Testing Hypotheses About the Cointegrating Vector 615 


Then [19.3.48] implies that B,(r) is scalar Brownian motion with variance (oT)? 
while B,(r) is g-dimensional Brownian motion with variance matrix 23, with B,(-) 
independent of B,(-). The process B(r) in turn can be viewed as generated by a 
different standard Brownian motion W'(r), where 


Bir)] _ [ot 0 Wir) 

B,(r) 0 PZILWi(r 
for P3,P3, = 2%, the Cholesky factorization of 23,. The result [19.3.45] can then 
equivalently be expressed as 


fee - ‘ 
T(¥7 — ) 


; {f owscor ares 


pa { wi ar raf owuntwienr arbre 
ot wi(1) | 


«| aff Wii(r) anroher + xt 


If it were not for the presence of the constant XN’, the distribution in [19.3.50} 
would be of the form of [19.3.11], from which it would follow that conditional on 
W!(-), the variable in [19.3.50] would be Gaussian and test statistics that are 
asymptotically x? could be generated as before. 

Recalling [19.3.39], one might propose to estimate 2* by 


[19.3.50] 


oe Sas . a a 
E al =f + Sa-w@+ ope. +h, [19.351] 
3, os fmt 


where 


Las bo 
x 


-T3 he at.) ec i 
veiL(yZ2,) (G31) [19.3.52] 
_[f2 Te 
“Lip Fg 
for 2* the sample residual resulting from estimation of [19.3.36] by OLS and 
2 = Ay,,. To arrive at a similar estimate of X*, note that [19.3.46] can be written 


we * , 1 
2», E{u2,-,[2; vsa| eer 


_ 27 Uy yy ‘ 1 
2, ee ie (33)- es 


< ry 1 
3, [rel [-cs)-ss) 


This suggests the estimator 


RE= § {i — [vg + ol} [ier eer] lf eA [19.3.53] 


616 Chapter 19 | Cointegration 


XT 


The fully modified OLS estimator proposed by Phillips and Hansen (1990) is then 


)-Le 2ST Teue al 
qT ZY, LYnY2 {Zy2J1, ie TRY} 


for fi, =y,, - Bt (SE) -Ay,,. This analysis implies that 


Tat — @)] _ 1 T-3?3y!, F T-1232t 
TAT - y) T~**3yx T-*3yn¥x Ty} — Ry 
where 


a {f owseor ares 


Pz | Wi) dr Pa [ owsorowser ar}py 
wil) 
Pa f wi0) one}| | 
It follows as in [19.3.12] that 
wo} 


yy 
V2 


{f ewsenr ar es 


for 


| oe [ wa ar Pa [ wrorwror ares 

Furthermore, [19.3.49] suggests that a consistent estimate of (of)? is provided by 
(61? = 3h - SH) 3h, 

with 2 given by [19.3.51]. Thus, if we multiply the usual Wald form of the x? test 


of m restrictions of the form Ry = r by (s7/é{)?, the result is an asymptotically 
x7(m) statistic under the null hypothesis: 


+1 -1 
T Syx 0’ P 
At\2,.,2 — tt — phd (gt $te 
(57/61 x4 = (RAY - {eo [o nif rele Bi Ray - 1} 


= {R-T4P - wy (oir R] 


- -1 
: T-323y,] ‘To’ 
x RT Sit — 

a bese pf Ey 


. -1 
5 (orenvy'{ (oir ns] > ]} (Rv,) 
~ x7(m). 


19.3. Testing Hypotheses About the Cointegrating Vector 617 


This description has assumed that there was no drift in any elements of the 
system. Hansen (1992) showed that the procedure is easily modified if E(Ay,,) = 
8, # 0, simply by replacing a,, = Ay,, in [19.3.52] with 


a, = Ay2, — é,, 
where 
6 T 
8, = T-! DS, Ayo. 
t=1 


Hansen also showed that a time trend could be added to the cointegrating relation, 
as in 


Yu = at y’ya + St + zt, 
for which the fully modified estimator is 


at T Sy Sty Sst, 
¥i| =| Syn Syay Zyat| | ZyaSt,-— TRF}. 
éut St Stys, «= Se? Styt, 


Collecting these estimates in a vector bt? = (a1, [97]’, 61)’, a hypothesis in- 
volving m restrictions on B of the form RB = r can be tested by 


T ‘Sy, Sty)” 
{Rbit = r}'4(61)PR| By, Tyas, Zyant| R’} {Rby! — r}> y2(m). 
Zt Sty, «Ee? 


Park’s Canonical Cointegrating Regressions 


A closely related idea has been suggested by Park (1992). In Park’s procedure, 
both the dependent and explanatory variables in [19.3.36] are transformed, and 
the resulting transformed regression can then be estimated by OLS and tested using 
standard procedures. Park and Ogaki (1991) explored the use of the VAR pre- 
whitening technique of Andrews and Monahan (1992) to replace the Bartlett es- 
timate in expressions such as [19.3.51]. 


APPENDIX 19.A. Proofs of Chapter 19 Propositions 


@ Proof of Proposition 19.2. Define y,,=zf + z7 +--+ + zi fort=1,2,..., T and 
Yio = 0. Then 
iJ-fle« 
Y2e Y2,0, 
where 
t z* 
-4(2 
s=1 | Ub, 


Hence, result (e) of Proposition 18.1 establishes that 


zs > E vel us] af [ TWO] rawcnr fav + > ry’ [19.4.1] 


mm LY2s-1 


618 Chapter 19 | Cointegration 


At = W*(1)-P 
at z * ’ 
ry = E w, [z4., Us sy] 


It follows from [19.A.1] that 


73 [te us] = 73 [ler oi +73 | Jee us] 


Bea gia ips: [19.A.2] 
4 a-{ [ [wr)] fewer} aw +2. 
Similarly, results (a), (g), and (i) of Proposition 18.1 imply 
T-12 > 2 | 4. A*-W(2) [19.4.3] 
T-3 D3 ké ‘| > At [ W(r) dr [19.A.4] 
S/F lx, saSan{f wow a}-ar. (9.0.5) 


Observe that the deviations of the OLS estimates in [19.2.12] from the population 
values a and y that describe the cointegrating relation [19.2.9] are given by 


Pot ree elec 
W- ZY LZYxYx Ly,,2/ |’ 
from which 


es - a " ie 0 I T Sy!, 
a 0 TELE ¥y Bex 
T-12 0’ 7-1 T~12 0' Ez* 
x| 0 all {{ 0 ule sl [19.A.6] 

7 1 ea T-"yz" 

~ LT-#2y, T-*Zyays,| LT Bya2? |" 
But from [19.A.2], 

rive = ar Spe 
4 {o ward. [wr] fawor'}-a" || 
+0 Sre(9] 


ze ar{ [wor fawon'}-at + 3 Blunzt.,). 


[19.4.7] 


Similar use of [19.A.3] to [19.A.5] in [19.4.6] produces [19.2.13]. m 


@ Proof of Proposition 19.3. For simplicity of exposition, the discussion is restricted to the 
case when E(Ay,,) = 0, though it is straightforward to develop analogous results using a 
rescaling and rotation of variables similar to that in {[18.2.43]. 


Appendix 19.A, Proofs of Chapter 19 Propositions 619 


Consider first what the results would be from an OLS regression of zi, on 23, = 
(Zi, Z35 ..» 5 24)’, @ constant, and y,,: 


zie = B'zh, + at + Nt, + U, [19.4.8] 


If this regression is evaluated at the true values a* = 0, X* = 0, and B = (6,, B;,..., 
B,)' the vector of population projection coefficients in [19.2.18], then the disturbance u, will 
be the residual defined in [19.2.18]. This residual had mean zero and was uncorrelated with 
z3,. The OLS estimates based on [19.A.8] would be 


6, Ezizt' Yzh Ezhyy,]~'[Izhzt 
at) =| 223! T  y; Zz* [19.A.9] 
Rt Zyoth! Lyy yas) LZyazh, 


The deviations of these estimates from the corresponding population values satisfy 


-1 


6, - B I,-1 0 0 ][Zzz27' Zzz Zzryy 
ay =|/0 1 0 223 T yy, 
TRS 0 o TI, Pe aE LY ZY 2Y be 

TI,-1 : ine 0 0 17'[Eztu, 

x] oO T 0 a 

0 : isi: 0 TI, LyzH, 

T-*22323/  T7'22x, nae “YT To 'zhu, 

=| T7323! 1 T-23y!, ru 

T-**Zy,23%'  T73?Zyy, To*2yayn | LT~*?2 yon, 


[19.A.10] 
Recalling that E(zju,) = 0, one can show that T-'Zzku, > 0 and T-'Zu,-% 0 by 


the law of large numbers. Also, T->?Zy,,u,—> 0, from the argument given in [19.A.7]. 
Furthermore, 


T"'S2tzt' TO See T-32Ezhy!, 
T- E23 1 T-323y!, 
T73?Zy,t% T7?Zy, To ?2y2¥x 


E(zzzt') 0 0 
4 0’ 1 {f [W(r)]' arkar , [19.A.11] 
0 AS | W(r) dr af [wr)] [W)]' arhar 


where W(r) is n-dimensional standard Brownian motion and Aj is a (g x m) matrix con- 
structed from the last g rows of &*(1)-P. Notice that the matrix in [19.A.11] is almost surely 
nonsingular. Substituting these results into [19.A.10] establishes that 


6, = B 0 
ae | 4)01, 
TR 0 


so that OLS estimation of [19.4.8] would produce consistent estimates of the parameters 
of the population linear projection [19.2.18]. 

An OLS regression of y,, on a constant and the other elements of y, is a simple 
transformation of the regression in [19.A.8]. To see this, notice that [19.4.8] can be written 
as 


[tL —B']z* = a* + R*y,, + u,. [19.A.12] 


620 Chapter 19 | Cointegration 


Solving {19.2.16] for z,* and substituting the result into [19.A.12] gives 
{2 —B'l(y, - wt - Ty.) = a* + Ry, + u,, 
oF, since yy, = (Yirs Yas ~~ + > Yad’, We have 
Vie = BrYa + BsYa + + ° + + BrYn + & + Nyy + u,, [19.A.13] 


where a =a* + [1 —B']pf and’ =*' + [1 —p‘]E. 

OLS estimation of [19.A.8] will produce identical fitted values to those resulting from 
OLS estimation of [19.A.13], with the relations between the estimated coefficients as just 
given. Since OLS estimation of [19.A.8] yields consistent estimates of [19.2.18], OLS es- 
timation of [19.A.13] yields consistent estimates of the corresponding transformed param- 
eters, as claimed by the proposition. @ 


@ Proof of Proposition 19.4. As in Proposition 18.2, partition AA’ as 


Zu pn 
AA! = | x0) axa) | [19.A.14] 
(nxn) 7 Z» 
x1) @xa) 
and define 
— > 45D ook! 
hie es (-Vot)- 2.22 i [19.4.15] 
0 L;, 
where 
(of? = (2, — 2yZH'Z,) {19.A.16] 
and L,, is the Cholesky factor of Z3': 
Bq! = Lal. [19.A.17] 
Recall from expression [18.A.16] that 
L'AA'L = I,, [19.A.18] 


implying that AA’ = (L‘)~'(L)~? and (AA‘)~' = LL’; thus, L is the Cholesky factor of 
(AA‘)~? referred to in Proposition 19.4. 

Note further that the residuals from OLS estimation of [19.2.24] are identical to the 
residuals from OLS estimation of 


yh = a* + y*'y3, + ut [19.A.19] 
for yi, = yy. — DVu2H'y2, and yz, = L2,y2,. Recall from equation [{18.A.21] that 
T~ét/o*] 1 [hy 
. 9.A.20 
| qylot |; it een 
Finally, for the derivations that are to follow, 
T*s T- 1, 


Proof of (a). Since the sample residuals 2 for OLS estimation of [19.A.19] are identical 
to those for OLS estimation of {19.2.24], we have that 


T 
>, O20? 
T*(6, — 1) = Tt \4*@——- - 1 
ary 
> ¢ y [19.A.21] 


(T*)* & at (a? - a7.) 
(T*)-? & (a*.,)? 


Appendix 19.A. Proofs of Chapter 19 Propositions 621 


at = of {(yilet) — (Wet)-¥F' ya — (az/ot)} 


19,A,22 
=ot-{{l —9#!lot]ét - (@%/ot)} a2) 
for 
* * 
+= Pee =L’y,. [19.A.23] 
Ya 


Differencing [19.A.22] results in 
(Q7 — 0.4) = of [1 -YP/of]Aégr. (19.A.24] 
Using [19.A.22] and [19.A.24], the numerator of [19.A.21] can be written 
T 
(T*)-' >» ar (ar — a7) 


= omer) >, {ts —¥F"lot]Et. - caxion}{ ery] st gel} 


= (ot)-[1 ~4poth{ (7 *)-* 2) 6h s(aee Ola 
*)2. 0 T#)- 12/458 /-7*). ~12 ; bg 
- (ot(T") (ation) {7 2, 6: ee 


[19.A.25] 


Notice that the expression 


l -aviot)- {ery 3 at ccer}| on 


. * 
-FYWet 


is a scalar and accordingly equals its own transpose: 
*1/ 28), tt ~ . * “7 1 
{t ~4r lot] {cr ) t >» &t (AE; ee 
= ara ~aioth {ery 2, £081] _stioe| 
; : 
+l ator fry > (srner Hace 


- arf ~#vlon {ry > (eae + (MEE, ee ; 


[19.A.26] 
But from result (d) of Proposition 18.1, 
(T*)-* >» (sce + (AE*)EM, ) 
= Lf > (1..0ay0 + (ayacye.)) [19.A.27] 
5 L-{A-[W()]-[WQ)]'-A’ — Ef(Ay,)(Ay))}-L 
= [W*(1)]-[W*(i)]’ — E[(AE?)(AE%")] 


622 Chapter 19 | Cointegration 


for W*(r) = L'A: W(r) the 2-dimensional standard Brownian motion discussed in equation 
(18.A.17]. Substituting [19.A.27] and [19.A.20] into [19.A.26] produces 


z 1 
en a, {cry 3 exter] ‘ .| 
sd = -¥70} [19.A.28] 


5 (v2yf1 —baktw*()]-w*())’ — El(agr)(A8? oof 3} 


Similar analysis of the second term in [19.A.25] using result (a) of Proposition 18.1 reveals 
that 


1 


(rs)-m¢axlon) {rey D3 case }| om 4 mwa 4} [19.4.29] 


Substituting [19.A.28] and [19.A.29] into [19.A.25], we conclude that 
T 

(T*)"* & at (a? - a7.) 
= 


1 


sear} {u - aiwaorowrcr| |, ]} 2 newer 3 


[19.A.30] 
— (1/2)-{1 -urtereenee m4] 


The limiting distribution for the denominator of [19.A.21] was obtained in result (b) 
of Proposition 18.2: 


Oo . 03_, 5 (ot)?-H,. [19.A.31] 
Substituting [19.A.30] and [19.A.31] into [19.A.21] produces [19.2.36]. 
Proof of (b). Notice that 
eur = (my Bee. 
= (PY D Cat ~ br Mee, ~ Br.) [19.433] 
= (my 3 (aa? — Gr — Yah} {Aat, ~ (6, — Yat,-1}. 
But [19.A.22] and [19.A.24] can be used to write 


rT 
(T*)-! ) Gr - 1)a7_ Agar, 


mp2 


I 1 
= (ot); - yy 3 {on —4F'lot]e, — (At/ot ase[_ sree 


ce 1 
- {corr - y/o asian) & 8(88"')| _gtion |} 
q 1 
- {on try@, - DHlry-mearter ry! E 482] _stiee |} 
[19.A.33] 


But result (a) implies that (7*)'7(4,; — 1) -, 0, while the other terms in [19.A.33] have 
convergent distributions in the light of [19.A.20] and results (a) and (e) of Proposition 18.1. 


Appendix 19.A. Proofs of Chapter 19 Propositions 623 


Hence, 
(19.A.34] 


T 
(TD Gr — 1at,Aat, 40. 


Similarly, 
T 
(TY* D Gr - Dias 
T 
= (tPF) D Gr - mn — $F! lot]et, - (axiot)| 


x {a aver - caxion} 
= oP) ¥ Gr - aE -4'lot -cry-masior |] EE, 
Bea a et {19.A.35] 


x [é. (TY) [1 -er'/ot 
—(T*)-"atlot] 


= (of P-[(T*) Gr - DP - ar /et 


Of eet. yee ]} 
«for pap aoane rT 


x [L -9lot ~(1*)-"4s/oty 


“0, 
given that (T*)-?32.)42 Hy &2)., and (7*)->2Zé*, are O,(1) by results (i) and (g) of 
Proposition 18.1. Substituting [19.A.34], [19.4.35], and then [19.A.24] into [19.A.32] gives 


Gr (Tt (Aad?)-(Aaz-,) 

= tf alot)" 3B (687-8)[ _ gig | 
: ™"* 119.4.36] 
3 (ot) -wy-eaey-eern| | | 
= (ot) wv -e¢an-exeare] |}, | 
It follows that for given q, 
he dar 2° 01 - Jl + Dir 
y= 
S(otF- -wg-e{ 3 0 - Wat 1] EIA) -(Ay?-p}}-E- Ea 


Thus, if q -» » with q/T — 0, 
BS ort -wyur{ § etay)-cx:oi}-e-[ |, | 
= (of)-[1 —ht]-L¥(1)PP'[W(1)]'L- | 7) [19.A.37] 

= (ot) [1 -wgt,[ 4]. 


by virtue of [19.A.18]. 


624 Chapter 19 | Cointegration 


But from [19.2.29] and [19.A.31], 


Ce 
T*)-? > a2. 
uw Py : [19.A.38] 
4 1 
(of) -H, 
It then follows from [19.A.36] and [19.A.37] that 
T*)? 63. + 52)-{h2. - & 
{( ) OG, 7 { T ort ; [19.A,39] 
[1 -hy)-, - wr efar)-crLr{ | | + H,. 
Subtracting } times [19.A.39] from [19.2.36] yields [19.2.37]. 
Proof of (c). Notice from ae that 
= (WAz)- i. rs} Le ser rar (1/2)-{T*-65, + Sz} {A} - 2 
= (Wh) (eat (Go.r/s3)"°T*(6, — 1) — (12){(T*)?-62_ + 53}-{43-- cant. 
[19.4.40] 
But since 
(6o,7/s}) = (T - 2KT -1)>1, 
it follows that 
‘ 1 
4 (hr) = Zr 
" T*-6,, = 87°” [19.A.41] 


L 1 ‘ 
a o*-(1 + hjh,)'2 (si VE) nm? 


with the last line following from (19.A.37], [19.A.38], and [19,2,37]. 


Proof of (d). See Phillips and Ouliaris (1990). m 


Chapter 19 Exercises 


19.1. Let 
Fa z 2] ‘ al 
Aya 6, ux 
where 8, # 0 and 4 May or may not be zero. Let u, = (u,,, Uz)’, and Suppose that u, = 
W(L)s, for ©, an iid. (2 x 1) vector with mean zero, variance PP’, and finite fourth 
moments. Assume further that {s-W,}7_, is absolutely summable and that W(1)-P is non- 


singular. Define fy = Zeta, £2, = a Thales and Yo = ibe: 
(a) Show that the OLS estimates of 


Yu = a+ Wa + Y, 


Chapter 19 Exercises 625 


satisfy 
, -1 
TG, | 2 1 eA [ T-*E(E,, — Yoku) | 
T(r — %) 5,/2 83/3 T~*?Z5t(E1, — Yokn) 
Conclude that &, and 4; have the same asymptotic distribution as the coefficients 
from a regression of (€,,~ Yoé2,) on a constant and 6, times a time trend.: 
(f- You) = a + yet + uy, 
(b) Show that first differences of the OLS residuals converge to 
Au, 4 ie — Yoo. 

19.2, Verify [19.3.23]. 
19.3. Verify [19.3.25]. 


19.4. Consider the regression model 
Yu = Bw, + a+ Yat St + u,, 
where 
Ww, = (AY24-p> AY2s-p+is oy Aarts AY Aare, 0 AYzr+p)'> 
Let Ay,, = n,,, where 


Hl rye fe) 0 |Pen 
[<| - #ue,- | 0 oe 


and where ¢, is i.i.d. with mean zero, finite fourth moments, and variance 


ny _ {or O jlo, 0 
a a kK ok ra 
Suppose that {s-W,}"_, is absolutely summable, A,, = o,-$4,(1) # 0, and Ag, = W,,(1)-P2» 


is nonsingular. Show that the OLS estimates satisfy 


T2(6, - B) Q th, 
T'?(@r — a) 4 Ray 
T(¥r — ¥) Av, 
T2(5, — 6) Aus 


where Q = plim T~'2w,w,, T7'?2 wu, Shy, and 


W,(1) 
Y, 
V3 


= H-1| An | [W,(r)] ano} 
H= | Ay | W,(r) dr haf | [W.(7)}-[W2(7)]’ arhit An | rW,(r) dr 


’ 


{ma - {mm ar} 


1 { | [w.(r)]’ arbi 1/2 


1/2 { i r[w.(n]’ arhit 13 


Reason as in [19.3.12] that conditional on W,(:), the vector (»,, v3, v3)‘ is Gaussian with 
mean zero and variance H-'. Use this to show that the Wald form of the OLS y? test of 
any m restrictions involving a, , or 6 converges to (A?,/s}) times a x?(m) variable. 


19.5. Consider the regression model ; 
Yu = Bw, + @ + Ya + u, 


626 Chapter 19 | Cointegration 


where 
Wr = (AY2.-9s AYze—pris + + + A¥2r—15 AY AYare1s +++» AYeep)’- 
Suppose that 
Ay, = 8, + Ux, 


where at least one of the elements of 8, is nonzero. Let u, and u,, satisfy the same conditions 
as in Exercise 19.4. 


Let yo = (Yas Yar +++. Yu)’ and 8 = (6,, 6, ..., 6,)’, and suppose that the 
elements of y,, are ordered so that E(Ay,,) = &, # 0. Notice that the fitted values for the 
regression are identical to those of 

Yu = Bw? + a + y"y3, + 8*y,, + u,, 
where 


we = [(Ayor-p <a 8,)’, (Ayor—p41 = 5)‘, bey (AY20+p a 8,)'’ 


Yar — (8/8, )Y eu 
* = Y3 (83/8, )¥ne 


Yu 
(e-1)x1] : 

Yn-148 — (8, = 1/8) Vn 
2 

y= : 
Vn-1 

5* = Yn + Y2(52/8,) + ¥3(55/5,) + ++ + n(n —1/8,) 

a* =a + B'(1@ 5), 


with 1a [(2p + 1) x 1] column of 1s. 

Show that the asymptotic properties of the transformed regression are identical to 
those of the time trend regression in Exercise 19.4, Conclude that any F test involving y in 
the original regression can be multiplied by (s}/A?,) and compared with the usual F tables 
for an asymptotically valid test. 


Chapter 19 References 


Ahn, S. K., and G. C. Reinsel. 1990. “Estimation for Partially Nonstationary Multivariate 
Autoregressive Models.” Journal of the American Statistical Association 85:813-23, 
Anderson, T. W. 1958. An Introduction to Multivariate Statistical Analysis. New York: Wiley. 
Andrews, Donald W. K., and J. Christopher Monahan. 1992. ‘‘An Improved Heteroske- 
dasticity and Autocorrelation Consistent Covariance Matrix Estimator.” Econometrica 60:953— 
66. 

Baillie, Richard T., and David D. Selover. 1987. “Cointegration and Models of Exchange 
Rate Determination.” International Journal of Forecasting 3:43-51. 

Campbell, John Y., and Robert J. Shiller. 1988a. “Interpreting Cointegrated Models,” 
Journal of Economic Dynamics and Control 12:505-22. 

and . 1988b. “The Dividend-Price Ratio and Expectations of Future Dividends 
and Discount Factors.” Review of Financial Studies 1:195-228. 

Clarida, Richard. 1991. ““Co-Integration, Aggregate Consumption, and the Demand for 
Imports: A Structural Econometric Investigation.” Columbia University. Mimeo. 

Corbae, Dean, and Sam Ouliaris. 1988. “Cointegration and Tests of Purchasing Power 
Parity.” Review of Economics and Statistics 70:508-11. 


Chapter 19 References 627 


Davidson, James E. H., David F. Hendry, Frank Srba, and Stephen Yeo. 1978. “Econo- 
metric Modelling of the Aggregate Time-Series Relationship between Consumers’ Expend- 
iture and Income in the United Kingdom.” Economic Journal 88:661-92. 

Engle, Robert F., and C. W. J. Granger. 1987. “Co-Integration and Error Correction: 
Representation, Estimation, and Testing.” Econometrica 55:251-76. 

and Byung Sam Yoo. 1987. “Forecasting and Testing in Co-Integrated Systems.” 
Journal of Econometrics 35:143-59. 

Granger, C. W. J. 1983. ““Co-Integrated Variables and Error-Correcting Models.” Unpub- 
lished University of California, San Diego, Discussion Paper 83-13. 

and Paul Newbold. 1974. “Spurious Regressions in Econometrics.”’ Journal of Econ- 
ometrics 2:111—-20. 

Hansen, Bruce E. 1990. “A Powerful, Simple Test for Cointegration Using Cochrane- 
Orcutt.” University of Rochester. Mimeo. 

. 1992. “Efficient Estimation and Testing of Cointegrating Vectors in the Presence 
of Deterministic Trends.” Journal of Econometrics 53:87-121. 

Haug, Alfred A. 1992. “Critical Values for the Z,-Phillips-Ouliaris Test for Cointegration.”’ 
Oxford Bulletin of Economics and Statistics 54:473-80. 

Johansen, Sgren. 1988. “Statistical Analysis of Cointegration Vectors.” Journal of Economic 
Dynamics and Control 12:231-54. 

. 1991. “Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian 
Vector Autoregressive Models.” Econometrica 59:1551—80. 

King, Robert G., Charles I. Plosser, James H. Stock, and Mark W. Watson. 1991, “Sto- 
chastic Trends and Economic Fluctuations.” American Economic Review 81:819-40. 
Kremers, Jeroen J. M. 1989. “U.S. Federal Indebtedness and the Conduct of Fiscal Policy.” 
Journal of Monetary Economics 23:219-38. 

Mosconi, Rocco, and Carlo Giannini. 1992. ‘““Non-Causality in Cointegrated Systems: Rep- 
resentation, Estimation and Testing.” Oxford Bulletin of Economics and Statistics 54:399— 
417. 


Ogaki, Masao. 1992. “‘Engel’s Law and Cointegration.” Journal of Political Economy 100:1027— 
46. 


and Joon Y. Park. 1992. “A Cointegration Approach to Estimating Preference 
Parameters.” Department of Economics, University of Rochester. Mimeo. 


Park, Joon Y. 1992. “Canonical Cointegrating Regressions.” Econometrica 60:119-43. 


and Masao Ogaki. 1991. “Inference in Cointegrated Models Using VAR Prewhi- 
tening to Estimate Shortrun Dynamics.” University of Rochester. Mimeo. 


, 5. Ouliaris, and B. Choi. 1988. “Spurious Regressions and Tests for Cointegration.” 
Cornell University. Mimeo. 


Phillips, Peter C. B. 1987. ‘‘Time Series Regression with a Unit Root.” Econometrica 55:277— 
301, 


———. 1991. “Optimal Inference in Cointegrated Systems.” Econometrica 59:283~306. 
and S. N. Durlauf. 1986. “Multiple Time Series Regression with Integrated Proc- 
esses.” Review of Economic Studies 53:473-95. 

and Bruce E. Hansen. 1990. “Statistical Inference in Instrumental Variables Regres- 
sion with I(1) Processes.” Review of Economic Studies 57:99-125,. 

and Mico Loretan. 1991. ‘Estimating Long-Run Economic Equilibria.” Review of 
Economic Studies 58:407-36. 

and S. Ouliaris. 1990. ““Asymptotic Properties of Residual Based Tests for Coin- 
tegration.” Econometrica 58:165-~93. 

Saikkonen, Pentti. 1991. “Asymptotically Efficient Estimation of Cointegration Regres- 
sions.” Econometric Theory 7:1-21. 

Sims, Christopher A., James H. Stock, and Mark W. Watson. 1990. “Inference in Linear 
Time Series Models with Some Unit Roots.” Econometrica 58:113-44, 

Stock, James H. 1987. “Asymptotic Properties of Least Squares Estimators of Cointegrating 
Vectors.” Econometrica 55:1035-56. 


, 1990. “A Class of Tests for Integration and Cointegration.” Harvard University. 
Mimeo. 


628 Chapter 19 | Cointegration 


Stock, James H., and Mark W. Watson. 1988. “Testing for Common Trends.” Journal of 
the American Statistical Association 83:1097~-1107. 

and . 1993. “A Simple Estimator of Cointegrating Vectors in Higher Order 
Integrated Systems.” Econometrica 61:783—820. 

Wooldridge, Jeffrey M. 1991. “Notes on Regression with Difference-Stationary Data.” 
Michigan State University. Mimeo. 


Chapter 19 References 629 


20 Full-Information 


Maximum Likelihood 
Analysis 
of Cointegrated Systems 


An (n x 1) vector y, was said to exhibit A cointegrating relations if there exist h 
linearly independent vectors a, a, ... , a, such that a;y, is stationary. If such 
vectors exist, their values are not uniquely defined, since any linear combinations 
of a,, @,... , a, would also be described as cointegrating vectors. The approaches 
described in the previous chapter sidestepped this problem by imposing normali- 
zation conditions such as a,, = 1. For this normalization we would put y,, on the 
left side of a regression and the other elements of y, on the right side. We might 
equally well have normalized a,, = 1 instead, in which case y2, would be the variable 
that belongs on the left side of the regression. The results obtained in practice can 
thus depend on an essentially arbitrary assumption. Furthermore, if the first var- 
iable does not appear in the cointegrating relation at all (a,, = 0), then setting 
4, = 1 is not a harmless normalization but instead results in a fundamentally 
misspecified model. 

For these reasons there is some value in using full-information maximum 
likelihood (FIML) to estimate the linear space spanned by the cointegrating vectors 
@,, &,..., @,. This chapter describes the solution to this problem developed by 
Johansen (1988, 1991), whose work is closely related to that of Ahn and Reinsel 
(1990), and more distantly to that of Stock and Watson (1988). Another advantage 
of FIML is that it allows us to test for the number of cointegrating relations. The 
approach of Phillips and Ouliaris (1990) described in Chapter 19 tested the null 
hypothesis that there are no cointegrating relations. This chapter presents more 
general tests of the null hypothesis that there are hy cointegrating relations, where 
hg could be 0,1,...,0rn —- 1. 

To develop these ideas, Section 20.1 begins with a discussion of canonical 
correlation analysis. Section 20.2 then develops the FIML estimates, while Section 
20.3 describes hypothesis testing in cointegrated systems. Section 20.4 offers a brief 
overview of unit roots in time series analysis. 


20.1. Canonical Correlation 


Population Canonical Correlations 


Let the (nm, x 1) vector y, and the (n, x 1) vector x, denote stationary random 
variables. Typically, y, and x, are measured as deviations from their population 
means, so that E(y,y;) represents the variance-covariance matrix of y,. In general, 
there might be complicated correlations among the elements of y, and x,, sum- 


630 


marized by the joint variance-covariance matrix 


Etyyi) Ely.x;) Zyy Lyx 
(a, X11) (ay x 12) = (axa) (ty Xm) 

E(xy,) E(x,x:) Zxy  2xx 
(22 xm) (12 x 12) (2%) (tn X) 


We can often gain some insight into the nature of these correlations by defining 
two new (n x 1) random vectors, n, and &,, where n is the smaller of n, and n. 
These vectors are linear combinations of y, and x,, respectively: 
Hn = Hy, [20.1.1] 
&, = A'x,. [20.1.2] 


Here X'and @’ are (n X n,) and (nm X n,) matrices, respectively. The matrices 
X' and st’ are chosen so that the following conditions hold. 


(1) The individual elements of y, have unit variance and are uncorrelated with 
one another: 


E(qm) = X'ZyyH = I. [20.1.3] 

(2) The individual elements of &, have unit variance and are uncorrelated with 
one another: ; 

E(E&:) = A’ 2xx8 = I,. [20.1.4] 


(3) The ith element of », is uncorrelated with the jth element of &, for i # /; for 
i = j, the correlation is positive and is given by r;: 


E(ém:) = A'Z xy = R, [20.1.5] 
where 
1 0 --+ 0 
Orn -::: 0 
R=;... . |. [20.1.6] 
0 O +: Fr, 


(4) The elements of y, and &, are ordered in such a way that 


2=n2zn2z---2r,2 0). [20.1.7] 


The correlation r;is known as the ith population canonical correlation between 
y, and x,. 

The population canonical correlations and the values of X and # can be 
calculated from Zyy, Zxx, and Zyy using any computer program that generates 
eigenvalues and eigenvectors, as we now describe. 

Let (Ai, Az, . - - » An,) denote the eigenvalues of the (nm, x n,) matrix 

Dy Dyxlxe2xyv, [20.1.8] 
ordered as 
(Ai 2A, 2 +++ 2 An) [20.1.9] 
with associated eigenvectors (K,,k,, ..., k,,). Recall that the eigenvalue- 
eigenvector pair (A;, k,) satisfies 


Del yxDtDxvk; = A,k,. [20.1.10] 
Notice that if k, satisfies [20.1.10], then so does ck, for any value of c. The usual 


20.1. Canonical Correlation 631 


normalization convention for choosing c and thus for determining “the” eigenvector 
k, to associate with A, is to set k/k, = 1. For canonical correlation analysis, however, 
it is more convenient to choose c so as to ensure that 

k;Zyyk; = 1 fori = 1,2,. ny. [20.1.11] 


If a computer program has calculated eigenvectors (ka. i. Sten k,,) of the matrix 
in [20.1. 8} ee by (k/k,) = 1, it is trivial to change these to eigenvectors 
(K,, kz, . . - , k,,) normalized by the condition [20.1.11] by setting 

k, = k, + Vk! Dyyk;- 


We further may multiply k; by —1 so as to satisfy a certain sign convention to be 
detailed in the paragraphs following the next proposition. 


The canonical correlations (r,, r.,.. - , r,) turn out to be given by the square 
roots of the corresponding first n eigenvalues (A,, Az, ... , Aw) of [20.1.8]. The 
associated (n, X 1) eigenvectors k,, kj, ..., k,, when normalized by [20.1.11] 


and a sign convention, turn out to make up the rows of the (m X m,) matrix X’ 
appearing in [20.1.1]. The matrix #’ in [20.1.2] can be obtained from the normalized 
eigenvectors of a matrix closely related to [20.1.8]. These results are developed in 
the following proposition, proved in Appendix 20.A at the end of this chapter. 


Proposition 20.1: Let 


Zyy = Lyx 
= (myx) (m4 X02) 
(ey + 02) x (ery + 12) Zxy Zxx 
(mzx) (tg x M2) 
be a positive definite symmetric matrix and let (Ay, Az, . . . » An,) be the eigenvalues 


of the matrix in [20.1.8], ordered 4, = A, =--- = A,,. Let (ki, ko, ...,k,,) be 
the associated (n, X 1) eigenvectors as normalized by (20.1.11). Let (14, Ma... 
Hn,) be the eigenvalues of the (n, X nz) matrix 


Tax lav 2 Zyx, [20.1.12] 
ordered pp, = pp = *'* 2 pty, Let (a, a, ..-, ,,) be the eigenvectors of 
[20.1.12]: 

Dellay ly Zyxa; = 1:8, [20.1.13] 
normalized by 
ajdyxa, = 1 fori=1,2,...,m. [20.1.14] 


Let n be the smaller of n, and n,, and collect the first n vectors k; and the first n 
vectors a, in matrices 


x =([k, kk --- &k,) 
(a, xn) 
4 =f[a, a --- a,j. 
(nz xn) 


Assuming that ,, Az, ... , A, are distinct, then 

(a) OSA,<1fori=1,2,...,n,and0=y,<1forj =1,2,...,m; 
(6) A; = w fori =1,2,...,73 

(c) HTyyH=1, and A'XIxxh4 = I,: 

(4) AZ xyH = R, 


where R is a diagonal matrix whose squared diagonal elements correspond to the 


632 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


eigenvalues of [20.1.8]: 


A, 0 0 
pie 0 A, 0 
0 0 A 


If = denotes the variance-covariance matrix of the vector (y;, x;)', then results 
(c) and (d) are the characterization of the canonical correlations given in [20.1.3] 
through [20.1.5]. Thus, the proposition establishes that the squares of the canonical 
correlations (r3, r3,. . . ,r2) can be found from the first n eigenvalues of the matrix 
in [20.1.8]. Result (b) states that these are the same as the first n eigenvalues of 
the matrix in [20.1.12]. The matrices % and o that characterize the canonical 
variates in [20.1.1] and [20.1.2] can be found from the normalized eigenvectors of 
these matrices. 

The magnitude a}Zyyk, calculated by the algorithm described in Proposition 
20.1 need not be positive—the proposition only ensures that its square is equal to 
the square of the corresponding canonical correlation. If a; Zyyk,; < 0 for some i, 
one can replace k, as calculated with —k,, so that the ith diagonal element of R 
will correspond to the positive square root of A;. 

As an illustration, suppose that y, consists of a single variable (n, = = 1). 
In this case, the matrix [20.1.8] is just a scalar, a (1 x 1) ‘‘matrix” that is equal 
to its own eigenvalue. Thus, the squared population canonical correlation between 
a scalar y, and a set of m, explanatory variables x, is given by 

a 6 Lyx2xx2xy 
SS ae 
Lyy 

To interpret this expression, recall from equation [4.1.15] that the mean squared 
error of a linear projection of y, on x, is given by 


MSE = Ly im: Zyxlex2xy> 


and so 
1-A=S% - eee. (20.1.15] 


Thus, for this simple case, r? is the fraction of the population variance that is 
explained by the linear projection; that is, r7 is the population squared multiple 
correlation coefficient, commonly denoted R?. 

Another interpretation of canonical correlations is also sometimes helpful. 
The first canonical variates 7,, and &,, can be interpreted as those linear combi- 
nations of y, and x,, respectively, such that the correlation between 7,, and ¢,, is 
as large as possible (see Exercise 20.1). The variates 2, and &,, give those linear 
combinations of y, and x, that are uncorrelated with »,, and é,, and yet yield the 
largest remaining correlation between 72, and &,, and so on. 


Sample Canonical Correlations 


The canonical correlations r; calculated by the procedure just described are 
population parameters—they are functions of the population moments Zyy, yx, 
and Xxx. Here we describe their sample analogs, to be denoted 7;. 


20.1. Canonical Correlation 633 


Suppose we have a sample of T observations on the (nm, x 1) vector y, and 
the (”, x 1) vector x,, whose sample moments are given by 


T 
Zyy = (1/T) > yy! [20.1.16} 
x T 
Zyx = (1/T) D> yx; [20.1.17] 
t=1 
x T 
Zxx = (1/T) > x,x/, [20.1.18] 


Again, in many applications, y, and x, would be measured in deviations from their 
sample means. 

To calculate sample canonical correlations, the objective is to generate a set 
of T observations on a new (n X 1) vector %,, where 7 is the smaller of n, and n,. 
The vector %, is a linear combination of the observed value of y,: 


= Ky, [20.1.19] 


for H an (m, X n) matrix to be estimated from the data. The task will be to choose 
% so that the ith generated series (7%,,) has unit sample variance and is orthogonal 
to the jth generated series: 


(1/7) D3 aa = 1. [20.1.20] 


Similarly, we will generate an (n x 1) vector E, from the elements of x,: 
= o’x,. [20.1.21] 
Each of the variables &, has unit sample variance and is orthogonal to é, for 
i+ j: 
T aA A 
(VT) 2 68: = 1. [20.1.22] 


Finally, #4, is orthogonal to é, for i + j, while the sample correlation between 
fy, and &;, is called the sample canonical correlation coefficient. 


i a a 
(1/7) > Ea, = R [20.1.23] 
tel 
for 
7, 0 0 
, 0 fF +: 0 
R=|. ; an [20.1.24] 
0 0 i, 


Finding matrices H, 4, and R satisfying [20.1.20], [20.1.22], and [20.1.23] 
involves exactly the same calculations as did finding matrices HX, 4, and R satisfying 
[20.1.3] through [20.1.5]. For example, [20.1.19] allows us to write [20.1.20] as 


T a T a a A a 
= (1/T) 2 ama = K'(/T) Dy YY, H = XH’ ZyH, (20.1.25] 
t= t= 
where the last line follows from [20.1.16]. Expression [20.1.25] is identical to 


634 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


[20.1.3] with hats placed over the variables. Similarly, substituting [20.1.21] into 
(20.1.22] gives of’ Lax = I,, which corresponds to [20.1.4]. Equation [20. 1.23} 
becomes sf’ Sev = R, as in [20.1.5]. Again, we can replace k, with —k, if any 
of the elements of R should turn out negative. 

Thus, to calculate the sample canonical correlations, the procedure described 
in Proposition 20.1 is simply applied to the sample moments (Zyy, Zyx, and 
2xx) rather than to the population moments. In particular, the square of the ith 
sample canonical correlation (F?) is given by the ith largest eigenvalue of the matrix 


bowed fab Ox T “1 T 
Dey yx2xK2xy = {am > sai} {am >» sai} 
-1 
x {am > xxi] {am  xai}. 


The ith column of & is given by the eigenvector associated with this ith eigenvalue, 
normalized so that 


[20.1.26] 


T 
kam 2 sat hi Sk 


The ith column of of is given by the eigenvector associated with the eigen- 
value Xk, for the matrix phd el ee em normalized by the condition that 
alD xxii, = 1. 

For example, suppose that y, is a scalar (n = mn, = 1). Then [20.1.26] is a 
scalar equal to its own eigenvalue. Hence, the sample squared canonical correlation 
between the scalar y, and a set of n, explanatory variables x, is given by 


09 eS Byahr "=x xi} 5h Se ‘Sx, yd 
ry T-3y2 
Vi 
. By, 1X, HEx, xi} YEx,y, Bt 
{Zy?} 


? 


which is just the squared sample multiple correlation coefficient R?. 


20.2. Maximum Likelihood Estimation 


We are now in a position to describe Johansen’s approach (1988, 1991) to full- 
information maximum likelihood estimation of a system characterized by exactly 
h cointegrating relations. 

Let y, denote an (n x 1) vector. The maintained hypothesis is that y, follows 
a VAR(p) in levels. Recall from equation [19.1.39] that any pth-order VAR can 
be written in the form 


Ay, = GiAy-1 + GAy-2 + +> + + €-1AY¥-p 41 [20.2.1] 
+ at Co -1+ £,, 
with 
E(e,) =0 
; Q fort =rT 
Ele) = |g 


otherwise. 


20.2. Maximum Likelihood Estimation 635 


Suppose that each individual variable y, is [(1), although h linear combinations of 
y, are stationary. We saw in equations [19.1.35] and [19.1.40] that this implies that 
{> can be written in the form 


Co = —BA’ [20.2.2] 


for B an (n X A) matrix and A’ an (A x n) matrix. That is, under the hypothesis 
of h cointegrating relations, only # separate linear combinations of the level of y,_ , 
(the A elements of z,, = A’y,-1) appear in [20.2.1]. 

Consider a sample of T + p observations on y, denoted (y_p41,Y—p42,-- +3 
Yr). If the disturbances ©, are Gaussian, then the log likelihood of (y,, ys, .. . , 
Yr) conditional on (y_,+1, Y-p+2, - - » » Yo) is given by 


£(Q, ti ta, SR Ay Cp- 1 &, £0) 
= (—Tn/2) log(27) — (T/2) log|Q| 


T 
a (1/2) >» [ay — C,Ay,_1 — S2Ay,-2 - * + - tp-1AY:—p +1 7a CoY:-1)’ 
x O-"Ay, — GAy,-1 ~ GAy,-2 — > > * — €p-1AY-p41 — & tor}. 

‘ [20.2.3] 


The goal is to chose (Q, f:, &,. . - €,-1, &, &o) so as to maximize [20.2.3] subject 
to the constraint that & can be written in the form of [20.2.2]. 

We will first summarize Johansen’s algorithm, and then verify that it indeed 
calculates the maximum likelihood estimates. 


Step 1: Calculate Auxiliary Regressions 


The first step is to estimate a (p— 1)th-order VAR for Ay,; that is, regress 
the scalar Ay, on a constant and all the elements of the vectors Ay,_;, Ay,-3,..., 
Ay,-p+1 by OLS. Collect the i = 1,2, ..., OLS regressions in vector form as 


Ay, = ftp + TAy,-1 + Wpdy,-2 + >> + W_.Ay-p41 + @,, [20.2.4] 
where II, denotes an (n x n) matrix of OLS coefficient estimates and 4, denotes 
the (n x 1) vector of OLS residuals. We also estimate a second battery of regres- 
sions, regressing the scalar y,,_, on a constant and Ay,_;, Ay,_2,.-- , AY;-p41 
fori = 1,2,...,m. Write this second set of OLS regressions as! 


yi-1 = 6+ RAy,-1 + R,Ay,-2 Hee ke R, -1AY:—p +1 + 4, [20.2.5] 


with 9, the (n x 1) vector of residuals from this second battery of regressions. 


‘Johansen (1991) described his procedure as calculating ¥, in place of ?,, where ¥, is the OLS residual 
from a regression of y,_, on a constant and Ay,_,, Ay,-2,.--  AY;-p+1- Since ¥,, = y,-1 ~ Ay, — 
Ay... — +++ — Ay,_,41, the residual ¥, is numerically identical to 9, described in the text. 


636 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Step 2: Calculate Canonical Correlations 


Next calculate the sample variance-covariance matrices of the OLS residuals 
G, and %,: 


T 
Sw = (1/T) Dd 69; [20.2.6] 

t=1 

" T 
Suu = (1/7) > 0,0; [20.2.7] 

t=1 

Py T 
Zw = (1/T) > 0,0; [20.2.8] 

tml 

Soya Diy 

From these, find the eigenvalues of the matrix 
S13 Sahay [20.2.9] 
with the eigenvalues ordered A, > f, > - - - > A,. The maximum value attained 


by the log likelihood function subject to the constraint that there are A cointegrating 
relations is given by 


&L* = —(Tn/2) log(2m) — (Tn/2) — (T/2) loglSvel ——_—[20.2.10] 
- (T/) >» log(1 — A4,). 


Step 3: Calculate Maximum Likelihood Estimates 
of Parameters 


If we are interested only in a likelihood ratio test of the number of cointe- 
grating relations, step 2 provides all the information needed. If maximum likelihood 
estimates of parameters are also desired, these can be calculated as follows. Let 
&,, 4, ..., a, denote the (n x 1) eigenvectors of [20.2.9] associated with the h 
largest eigenvalues. These provide a basis for the space of cointegrating relations; 
that is, the maximum likelihood estimate is that any cointegrating vector can be 
written in the form 


a = b,8; + bof, +++ + Dyfi, 


for some choice of scalars (b,, 6, . . . ; b,). Johansen suggested normalizing these 
vectors 4, so that 4; Zyy4, = 1. For example, if the eigenvectors a; of [20.2.9] are 
calculated from a standard computer program that normalizes 4;4, = 1, Johansen’s 
estimate is 4; = 4, + Va/2y ya, Collect the first A normalized vectors in an 
(n X h) matrix A: 


A=[a, & --- 4,]- [20.2.11] 
Then the MLE of & is given by 
t) = SuyAA’. [20.2.12] 
The MLE of ¢; fori = 1,2,...,p — lis 
j= 11, - tk, {20.2.13] 
and the MLE of « is 
& = % — {6. [20.2.14] 


20.2. Maximum Likelihood Estimation 637 


The MLE of © is 
Tr a a 
B= (VT) D [(G, — bo¥.)(G, — £00)'T. [20.2.15] 


We now review the logic behind each of these steps in turn. 


Motivation for Auxiliary Regressions 


The first step involves concentrating the likelihood function.? This means 
taking (2 and (, as given and maximizing [20.2.3] with respect to (@, G1, &,..., 
{,-1)- This restricted maximization problem takes the form of seemingly unrelated 
regressions of the elements of the (n x 1) vector Ay, — Ccy,_, on a constant and 
the explanatory variables (Ay,_ 1, Ay,-2,. . . » AY;-p+1)- Since each of the n regres- 
sions in this system has the identical explanatory variables, the estimates of (a, C,, 


f2, --., ,-1) would come from OLS regressions of each of the elements of 
Ay, — oy,-1 on a constant and (Ay,_,, Ay,_2,... , Ay,-p+1)- Denote the values 
of (a, C1, 2, - . . , $,-1) that maximize [20.2.3] for a given value of { by 


[&* (Co), fF (£0), t3(60), Ss, S19 i. i(Co)]- 
These values are characterized by the condition that the following residual vector 
must have sample mean zero and be orthogonal to Ay,_,, Ay,-2, . - - » AY,-pa1! 
[Ay, — Goys—a] — {4*(Co) + EP Co)dy.-1 + BCo)dy.-2 
eer C3_1(Co)Ay,-p+it- 


But notice that the OLS residuals 4, in [20.2.4] and 9, in [20.2.5] each satisfy this 
orthogonality requirement, and therefore the vector 4, — {,%, also has sample mean 
zero and is orthogonal to Ay,_1, Ay,-2,. -- , AY;-p+1- Moreover, 0, — {o9, is of 
the form of expression [20.2.16], 


[20.2.16] 


a, — C60, = (Ay, — ftp — ThAy,-1 = TLAy,-2 ey TI, -:AY,—p+1) 
— Gol¥:-1 - 6 - R,Ay,-1 az RAy,-2 Ty es R,-1AY:-p+1)> 
with 
| &*(Lo) = tte — S08 [20.2.17] 
*(G) = 1, -— G8 fori=1,2,...,p—-1. [20.2.18] 


Thus, the vector in [20.2.16] is given by 4, — {,9,. 

The concentrated log likelihood function (to be denoted M) is found by 
replacing (a, ti, &, ---, €-1) in [20.2.3] with [&*(Co), Ci(Co), G20), - -- » 
£7 - 1(bo)]: 

M(Q, Go) = LO, Si(Lo), EF (Go), « - -  S3-1(Co), &*(Lo), Sot 
= —(Tn/2) log(2m) — (7/2) log|Q| [20.2.19] 
T 


~ (1/2) > [(a, -_ £09,)'Q-"(0, -_ £0%,)]- 
= 
The idea behind concentrating the likelihood function in this way is that if we can 
find the values of Q and C for which M is maximized, then these same values 
(along with &*(,) and C*({o)) will maximize [20.2.3]. 


?See Koopmans and Hood (1953, pp. 156-58) for more background on concentration of likelihood 
functions. 


638 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Continuing the concentration one step further, recall from the analysis of 
[11.1.25] that the value of © that maximizes [20.2.19] (still regarding , as fixed) 
is given by 


OG) = (UTD (CG. o¥.)(e — 609.)'L [20.2.20] 


As in expression [11.1.32], the value obtained for [20.2.19] when evaluated at 
[20.2.20] is then 


N(Go) = M{D*(Co), So} 
—(TnI2) log(2m) — (T/2) log|&*(£o)| — (Tn/2) 
—(Tn/2) log(27) — (Tn/2) [20.2.21] 


(1/T) >> [@, = Co¥,)(G, <= bov,)'] 


it 


— (T/2) log 


Expression [20.2.21] represents the biggest value one can achieve for the log 
likelihood for any given value of 9. Maximizing the likelihood function thus comes 
down to choosing (, so as to minimize 


(1/T) >» [(@, — Co9,)(a, — =) [20.2.22] 


subject to the constraint of [20.2.2]. 


Motivation for Canonical Correlation Analysis 


To see the motivation for calculating canonical correlations, consider first a 
simpler problem. Suppose that by an astounding coincidence, G, and 9, were already 
in canonical form, 


a= 
%, =; é, 
with 
T 
(1/T) > aa = 1, [20.2.23] 
tml 
(W/T) > &E = 1, [20.2.24] 
tml 
T A A 
(1/T) > &4/ = R [20.2.25] 
tml 
7, 0 0 
; Rot (0 
R=|. : = she [20.2.26] 
00 --: F, 


Suppose that for these canonical data we were asked to choose {, so as to minimize 


[20.2.27] 


(7) > «a — bo&) (a - ro | 


20.2. Maximum Likelihood Estimation 639 


subject to the constraint that Loe, could make use of only / linear combinations of 
£,. If there were no restrictions on €, (so that h = n), then expression [20.2.27] 
would be minimized by OLS regressions of 4, on €, fori = 1,2,.. . ,n. Conditions 
[20.2.24] and [20.2.25] establish that the ith regression would have an estimated 
coefficient vector of 


{wn 3 > be {am 3 > eau} = Fre, 


where e; denotes the ith column of I,,. Thus, even if all n elements of E, appeared 
in the regression, only the ith element é, would have a nonzero coefficient in the 
regression used to explain 7,,. The average squared residual for this regression 
would be 


fam py au] : {am 2 cada }{wr >» i} {am >» ean} 


1 - Fre L-e-7; 
=1- 7 


Moreover, conditions [20.2.23] through [20.2.25] imply that the residual for the ith 
regression, }j, — 7; é., would be orthogonal to the residual from the jth regression, 
Nye - PE ns for i # j. Thus, if {, were unrestricted, the optimal value for the matrix 
in [20.2.27].would be a diagonal matrix with (1 — 7?) in the row i, column i position 
and zero elsewhere. 7 

Now suppose that we are restricted to use only A linear combinations of &, 
as regressors. From the preceding analysis, we might guess that the best we can 
do is use the h elements of &, that have the highest correlations with elements of 
4, that is, choose (é,,, &,,.. . , &,) as regressors.? When this set of regressors is 
used to explain 4, for i = h, the average squared residual will be (1 — 7?), as 
before. When this set of regressors is used to explain 7,, for i > h, all of the 
regressors are orthogonal to 7,, and would receive regression coefficients of zero. 
The average squared residual for the latter regression is simply (1/7)27,72 = 1 
fori = h+1,h + 2,...,m. Thus, if we are restricted to using only # linear 
combinations of £,, the optimized value of [20.2.27] will be 


(/T)  UGae — BENCH - £86)') 


1-7 0 
0 1-7 

=| 0 0 1-7 0 -:- 0 [20.2.28] 
0 0 0 1 
0 0 woe 0 OQ «++ 1 
h 

= [Ja - #). 


i=l 
3See Johansen (1988) for a more formal demonstration of this claim. 


640 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Of course, the actual data d, and %, will not be in exact canonical form. 
However, the previous section described how to find (n x n) matrices % and 
such that 


a, = KA, [20.2.29] 
&, = 4'%,. [20.2.30] 


The columns of of are 2 given by the eigenvectors of the matrix in [20. 2.9], normalized 
by the condition 3’ Swal = I,. The eigenvalues of [20.2.9] give the squares of 
the canonical correlations: 


X, = 7}. [20.2.31] 


The columns of % correspond to the normalized eigenvectors of the matrix 
YouLuv2 vy Z yu, though it turns out that H does not actually have to be calculated 
in order to use the following results. Assuming that % and of are nonsingular, 
[20.2.29] and [20.2.30] allow [20.2.22] to be written 


jam >» a ~ bo¥,)(@, - tt] 
: wy >» [ice ~ beled’) (') a, — wey | 
. [eyo >> [ta ~ Kibo(sd") Ell - Sd’) | 60-* 


= |’)- \)-| 


T aA A AA 
ml [ta a TE)[4, at ney | 


T AA AA a 
> lun >» [ta ne TE) (a, - ney’ + XP, 
[20.2.32] 


where 
Tl = K'Go(s4')-?. [20.2.33] 


Recall that maximizing the concentrated log likelihood function for the actual 
data [20.2.21] is equivalent to choosing ) so as to minimize the expression in 
[20.2.32] subject to the requirement that {> can be written as BA’ for some (n X h) 
matrices B and A. But {, can be written in this form if and only if II in [20.2.33] 
can be written in the form fy’ for some (n x h) matrices B and y. Thus, the task 
can be described as choosing II so as to minimize [20.2.32] subject to this condition. 
But this is precisely the problem solved in [20.2.28]—the solution is to use as 
regressors the first h elements of &,. The value of [20.2.32] at the optimum is given 
by 


hk 
[I] @ - #2) = |KpP. [20.2.34] 
t=1 
Moreover, the matrix X satisfies 


T T , re ne . 
= (1/T) > AA = A/T) D HO AK = H'DypX. [20.235] 
t=l tel 


20.2. Maximum Likelihood Estimation 641 


Taking determinants of both sides of [20.2.35] establishes 
= |%'| Suu 194 
or 
KP = [Suul- 
Substituting this back into [20.2.34], it appears that the optimized value of [20.2.32] 
is equal to 
h 
Zul x [] @ - 7). 
ix 
Comparing [20.2.32] with [20.2.21], it follows that the maximum value achieved 
for the log likelihood function is given by 
&* = N(&) = —(Tni2) log(2m) - (Tn/2) - (T/2) oe x I (1 - 7?) }. 


as claimed in [20.2.10]. 


Motivation for Maximum Likelihood Estimates 
of Parameters 


We have seen that the concentrated log likelihood function [20.2.21] is max- 
imized by selecting as regressors the first A elements of E,. Since E, = 4'%,, 
this means using A’¥, as regressors, Where the (n x h) matrix A denotes the first 
h columns of the (n x n) matrix 4. Thus, 


Got, = —BA’G, [20.2.36] 


for some (n x A) matrix B. This verifies the claim that A is the maximum likelihood 
estimate of a basis for the space of cointegrating vectors. 

Given that we want to choose W, = A’9, as regressors, the value of B for 
which the concentrated likelihood function will be maximized is obtained from 
OLS regressions of &, on W,: 


- [wn > a] [wn vn, | . [20.2.37] 


But W, is composed of h canonical variates, meaning that 


[wn > wa | = I,. [20.2.38] 


Moreover, 


[wn > asi] = [un 3 > aA i| 
= Sieh 
Substituting [20.2.39] and [20.2.38] into [20.2.37], 
B = -YwA, 
and so, from [20.2.2], the maximum likelihood estimate of fp is given by 
bo = SuvAA’ 


[20.2.39] 


as claimed in [20.2.12]. 


642 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Expressions [20.2.17] and [20.2.18] gave the values of « and {; that maximized 
the likelihood function for any given value of fo. Since the likelihood function is 
maximized with respect to {, by choosing £, according to [20.2.12], it is maximized 
with respect to a and £, by substituting {, into [20.2.17] and [20.2.18], as claimed 
in [20.2.14] and [20.2.13]. Finally, substituting & into [20.2.20] verifies [20.2.15]. 


Maximum Likelihood Estimation in the Absence 
of Deterministic Time Trends 


The preceding analysis assumed that a, the (n x 1) vector of constant terms 
in the VAR, was unrestricted. The value of a contributes # constant terms for the 
h cointegrating relations, along with g = n — h deterministic time trends that are 
common to each of the n elements of y,. In some applications it might be of interest 
to allow constant terms in the cointegrating relations but to rule out deterministic 
time trends for any of the variables. We saw in equation [19.1.45] that this would 
require 

a = But, [20.2.40] 
where B is the (x X h) matrix appearing in [20.2.2] while yj is an (A x 1) vector 
corresponding to the unconditional mean of z, = A’y,. Thus, for this restricted 
case, we want to estimate only the h elements of yf rather than all n elements of 
a. 

To maximize the likelihood function subject to the restrictions that there are 
h cointegrating relations and no deterministic time trends in any of the series, 
Johansen’s (1991) first step was to concentrate out (,,f,..., and {,-, (but not 
a). For given a and {o, this is achieved by OLS regression of (Ay, — & — Coy,-1) 
on (Ay,_1, Ay,;-2, . - » , AY;-p+1). The residuals from this regression are related to 
the residuals from three separate regressions: 


(1) A regression of Ay, on (Ay,_,, Ay,-2,... , AY,;-p+1) With no constant term, 
Ay, = H,Ay,-, + T,Ay,-. + +++ + T,-sAy,-p41 + 5 [20.2.41] 
(2) A regression of a constant term on (Ay,_;, Ay,-2, - . - » AY;-p+1); 
1 = @jAy,-; + @Ay,-2 + +++ + @_ AY,-pait W,;  [20.2.42] 
(3) A regression of y,_; on (Ay,_1, Ay,-2,. - - , AY;-p+1) with no constant term, 
Yr-1 = RiAy,-1 + RoAy,-2 + °° + Rp-iAy,-par + ¥- [20.243] 
The concentrated log likelihood function is then 
MQ, a, So) = —(Tn/2) log(2m) — (T/2) log|f}| 


T . 
a (1/2) 2 [@a,- aw, a Co¥,)'2- Gi, oe aw, = Cov,)]- 
i= 
Further concentrating out 22 results in 


N(a, £0) 
= —(Tn/2) log(2m) — (Tn/2) [20.2.44] 


a (T/2) log > wn) — aw, — Cov.) (a, — aw, — Cov)’ 


Imposing the constraints a = But and [) = -BA’, the magnitude in [20.2.44] 


20.2. Maximum Likelihood Estimation 643 


can be written 
M(c, So) = —(Tni2) log(2m) - (Tni2) 


T . [20.2.45] 
— (7/2) log| >) (1/T){(a, + BA'W,)(a, + BA’w,)’}], 
t=1 
where 
a1, d 
Ww =]. 
(n+1)X1 v; 
A’ =[-pi A’. [20.2.46] 
hx(n+1) 
But setting {9 = —BA’ in [20.2.21] produces an expression of exactly the same 


form as [20.2.45], with A in [20.2.21] replaced by A and ¢, replaced by w,. Thus, 
the restricted log likelihood is maximized simply by replacing ¥, in the analysis of 
[20.2.21] with w,. 

To summarize, construct 


Zww = (W/T) > ww; 


Zwv = (1/T) DS ATH 


and find the eigenvalues of the (n + 1) x (n + 1) matrix 
Dah Daplatass [20.2.47] 


ordered A, >A, > - -- >A,,,1- The maximum value achieved for the log likelihood 
function subject to the constraint that there are A cointegrating relations and no 
deterministic time trends is 


£, = -(Tn/2) log(2m) — (Tnl2) - (T2) log|Z uel [20.2.48] 


- (T/2) > log(1 - Aj). 


Let 4,4, ..., 4,,, denote the eigenvectors of [20.2.47] normalized by 
a; Zwwa; = 1. Then the maximum likelihood estimate of A is given by the matrix 
[a,. 4 -++- &,]. The maximum likelihood estimate of BA’ is 
BA’ = —ZywAA’. [20.2.49] 
Recall from [20.2.46] that 
BA’ = [-Byy BA’) [20.2.50] 
=[-a ~-£0] 


Thus, [20.2.49] implies that the maximum likelihood estimates of « and { are 
given by 


[& td] = SywAA’. 
The MLE of &, is 
t,=T1, - aa; —@8, fori=1,2,...,p—-1, 


644 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


while the MLE of © is 


= (1/T) > [@, — &W, — t,¥,)(a,- &w, — t¥,)']. 


20.3. Hypothesis Testing 


We saw in the previous chapter that tests of the null hypothesis of no cointegration 
typically involve nonstandard asymptotic distributions, while tests about the value 
of the cointegrating vector under the maintained hypothesis that cointegration is 
present will have asymptotic x? distributions, provided that suitable allowance is 
made for the serial correlation in the data. These results generalize to FIML 
analysis. The asymptotic distribution of a test of the number of cointegrating re- 
lations is nonstandard, but tests about the cointegrating vector are often y?. 


Testing the Null Hypothesis of h Cointegrating Relations 


Suppose that an (” x 1) vector y, can be characterized by a VAR(p) in levels, 
which we write in the form of [20.2.1]: 


Ay, = tidy,-1 + GAy,-2 + ++ > + S,-1AY:-par + a + Soy:-1+ &,- [20.3.1] 


Under the null hypothesis H, that there are exactly h cointegrating relations among 
the elements of y,, this VAR is restricted by the requirement that {, can be written 
in the form & = —BA’, for B an (n x h) matrix and A’ an (h X n) matrix. 
Another way of describing this restriction is that only A linear combinations of the 
levels of y,., can be used in the regressions in [20.3.1]. The largest value that can 
be achieved for the log likelihood function under this constraint was given by 
[20.2.10]: 


* = —(Tn/2) log(2m) -— (Tn/2) — (T/2) log|Euyl 


‘ [20.3.2] 
- (T/2) > log(1 — 4,). 


Consider the alternative hypothesis H, that there are n cointegrating rela- 
tions, where n is the number of elements of y,. This amounts to the claim that 
every linear combination of y, is stationary, in which case y,_, would appear in 
[20.3.1] without constraints and no restrictions are imposed on (9. The value for 
the log likelihood function in the absence of constraints is given by 


L% = —(Tni2) log(2m) ~ (Tnl2) - (T/2) log|Syu| 
: [20.3.3] 
- (TR) > log(1 — A,). 


A likelihood ratio test of Hy against H,, can be based on 
£4 - £3 = -(TRQ) D log(i - A). 
imh+1 


If the hypothesis involved just [(0) variables, we would expect twice the ene like- 
lihood ratio, 


ues - £3) = -T > log(1 — X;,), [20.3.4] 


imhtl 


20.3. Hypothesis Testing 645 


to be asymptotically distributed as x?. In the case of Hy, however, the hypothesis 
involves the coefficient on y,_,, which, from the Stock-Watson common trends 
representation, depends on the value of g = (n — h) separate random walks. Let 
W(r) be g-dimensional standard Brownian motion. Suppose that the true value of 
the constant term a in [20.3.1] is zero, meaning that there is no intercept in any 
of the cointegrating relations and no deterministic time trend in any of the elements 
of y,. Suppose further that no constant term is included in the auxiliary regressions 
[20.2.4] and [20.2.5] that were used to construct i, and ¢,. Johansen (1988) showed 
that under these conditions the asymptotic distribution of the statistic in [20.3.4] 
is the same as that of the trace of the following matrix: 


Q= [ win) awe’ [ wi)wor’' ar] [ wr) awe’. [20.3.5] 


Percentiles for the trace of the matrix in [20.3.5] are reported in the case 1 portion 
of Table B.10. These are based on Monte Carlo simulations. 

If the number of cointegrating relations (h) is 1 less than the number of 
variables (n), then g = 1 and [20.3.5] describes the following scalar: 


{ ie Wn) awn} any owe ~ i} 


; { i, [W(r) ar} ; { [ [WNP ar} 


where the second equality follows from [18.1.15]. Expression [20.3.6] will be rec- 
ognized as the square of the statistic [17.4.12] that described the asymptotic dis- 
tribution of the Dickey-Fuller test based on the OLS t statistic. For example, if 
we are considering an autoregression involving a single variable (n = 1), the null 
hypothesis of no cointegrating relations (h = 0) amounts to the claim that ¢ = 0 
in [20.3.1] or that Ay, follows an AR(p — 1) process. Thus, Johansen’s procedure 
provides an alternative approach to testing for unit roots in univariate series, an 
idea explored further in Exercise 20.4. 

Another approach would be to test the null hypothesis of h cointegrating 
relations against the alternative of h + 1 cointegrating relations. Twice the log 
likelihood ratio for this case is given by 


es — £3) = —Tlog(i — A,,,). [20.3.7] 


Again, under the assumption that the true value of a = 0 and that no constant 
term is included in [20.2.4] or [20.2.5], the asymptotic distribution of the statistic 
in [20.3.7] is the same as that of the largest eigenvalue of the matrix Q defined in 
[20.3.5]. Monte Carlo estimates of this distribution are reported in the case 1 section 
of Table B.11. 

Note that if g = 1, thenn = h + 1. In this case the statistics [20.3.4] and 
[20.3.7] are identical. For this reason, the first row in Table B.10 is the same as 
the first row of Table B.11. 

Typically, the cointegrating relations could include nonzero intercepts, in 
which case we would want to include constants in the auxiliary regressions [20.2.4] 
and [20.2.5]. As one might guess from the analysis in Chapter 18, the asymptotic 
distribution in this case depends on whether or not any of the series exhibit de- 
terministic time trends. Suppose that the true value of a is such that there are no 
deterministic trends in any of the series, so that the true a satisfies a = Buf as 
in [20.2.40]. Assuming that no restrictions are imposed on the constant term in the 


[20.3.6] 


646 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


estimation of the auxiliary regressions [20.2.4] and [20.2.5], then the asymptotic 
distribution of [20.3.4] is given in the case 2 section of Table B.10, while the 
asymptotic distribution of [20.3.7] is given in the case 2 panel of Table B.11. By 
contrast, if any of the variables exhibit deterministic time trends (one or more 
elements of a — By are nonzero), then the asymptotic distribution of [20.3.4] is 
that of the variable in the case 3 section of Table B.10, while the asymptotic 
distribution of [20.3.7] is given in the case 3 section of Table B.11. 

When g = 1 anda # By ij, the single random walk that is common to y, is 
dominated by a deterministic time trend. In this situation, Johansen and Juselius 
(1990, p. 180) noted that the case 3 analog of [20.3.6] has a y?(1) distribution, for 
reasons similar to those noted by West (1988) and discussed in Chapter 18. The 
modest differences between the first row of the case 3 part of Table B.10 or B.11 
and the first row of Table B.2 are presumably due to sampling error implicit in 
the Monte Carlo procedure used to generate the values in Tables B.10 and B.11. 


Application to Exchange Rate Data 


Consider for illustration the monthly data for Italy and the United States 
plotted in Figure 19.2. The systems of equations in [20.2.4] and [20.2.5] were 
estimated by OLS for y, = (p,, 5;, p*)', where p, is 100 times the log of the U.S. 
price level, s, is 100 times the log of the dollar-lira exchange rate, and p/* is 100 
times the log of the Italian price level. The regressions were estimated over t = 
1974:2 through 1989:10 (so that the number of observations used for estimation 
was T = 189); p = 12 lags were assumed for the VAR in levels. 

The sample variance-covariance matrices for the residuals G, and %, were 
calculated from [20.2.6] through [20.2.8] to be 


0.0435114 -—0.0316283 0.0154297 


Suv = | —0.0316283 4.68650 —_-0,0319877 
0.0154297  0.0319877 0.179927 
427.366 —370.699 805.812 

Syy = | —370.699 424.083 —709.036 


805.812 -709.036 1525.45 


—0.484857 0.498758 + —0.837701 
Luv = | —1.81401 —2.95927 —-2.46896 
— 1.80836 1.46897 —3,58991 


The eigenvalues of the matrix in [20.2.9] are then* 


A, = 0.1105 
A, = 0.05603 
A; = 0.03039 


with 
T log(1 — A,) = —22.12 
T log(1 — A.) = —10.90 
T log(i — A3) = —5.83. 
Calculations were based on more significant digits than reported, and so the reader may find slight 


discrepancies in trying to reproduce these results from the figures reported. 


20.3. Hypothesis Testing 647 


The likelihood ratio test of the null hypothesis of h = 0 cointegrating relations 
against the alternative of h = 3 cointegrating relations is then calculated from 
[20.3.4] to be 


2(L4 — Lf) = 22.12 + 10.90 + 5.83 = 38.85. [20.3.8] 


Here the number of unit roots under the null hypothesis isg = n — h = 3. Given 
the evidence of deterministic time trends, the magnitude in [20.3.8] is to be com- 
pared with the case 3 section of Table B.10. Since 38.85 > 29.5, the null hypothesis 
of no cointegration is rejected at the 5% level. Similarly, the likelihood ratio test 
[20.3.7] of the null hypothesis of no cointegrating relations (h = 0) against the 
alternative of a single cointegrating relation (A = 1) is given by 22.12. Comparing 
this with the case 3 section of Table B.11, we see that 22.12 > 20.8, so that me 
null hypothesis of no cointegration is also rejected by this test. 

This differs from the conclusion of the Phillips-Ouliaris test for no cointe- 
gration between these series, on the basis of which the null hypothesis of no 
cointegration for these variables was found to be accepted in Chapter 19. 

Searching for evidence of a possible second cointegrating relation, consider 
the likelihood ratio test of the null hypothesis of h = 1 cointegrating relation 
against the alternative of h = 3 cointegrating relations: 


AL — L4) = 10.90 + 5.83 = 16.73. 


For this test, g = 2. Since 16.73 > 15.2, the null hypothesis of a single cointegrating 
relation is rejected at the 5% level. The likelihood ratio test of the null hypothesis 
of h = 1 cointegrating relation against the alternative of h = 2 relations is 10.90 
< 14.0; hence, the two tests offer conflicting evidence as to the presence of a second 
cointegrating relation. 

The eigenvector 4, of the matrix in [20.2.9] associated with A,, normalized 
so that 4;2yya, = 1, is given by 


= [-0.7579 0.02801 0.4220]. [20.3.9] 
It is natural to renormalize this by taking the first element to be unity: 
= [1.00 -0.04 -—0.56]. 


This is virtually identical to the estimate of the cointegrating vector based on OLS 
from [19.2.49]. 


Likelihood Ratio Tests About the Cointegrating Vector 


Consider a system of n variables that is assumed (under both the null and 
the alternative) to be characterized by A cointegrating relations. We might then 
want to test a restriction on these cointegrating vectors, such as that only q of the 
variables are involved in the cointegrating relations. For example, we might be 
interested in whether the middle coefficient in [20.3.9] is zero, that is, in whether 
the cointegrating relation involves solely the U.S. and Italian price levels. For this 
example h = 1, q = 2, and = 3. In general it must be the case thath Sq =n. 
Since A linear combinations of the g variables included in the cointegrating relations 
are stationary, if g = h, then all q of the included variables would have to be 
stationary in levels. If g = n, then the null hypothesis places no restrictions on the 
cointegrating relations. 


648 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Consider the general restriction that there is a known (gq X 7) matrix D’ such 
that the cointegrating relations involve only D’y,. For the preceding example, 


100 
D’ = 3. 
E 5 | [20.3.10] 


Hence, the error-correction term in [20.3.1] will take the form 
Co¥:-1 = —BA'D’y,-., 


where B is now an (n X A) matrix and A’ is an (h X q) matrix. Maximum likelihood 
estimation proceeds exactly as in the previous section, where ¥, in [20.2.5] is re- 
placed by the OLS residuals from regressions of D'y,_; on a constant and Ay,_,, 
Ay,-2, ». «+ AY,;-p+1- This is equivalent to replacing Yyy in [20.2.6] and Syy 
in [20.2.8] with 
Svy = D'SyyD [20.3.11] 
Spy * Eq. [20.3.12] 
Let A, denote the ith largest eigenvalue of 
SHS wre w [20.3.13} 


The maximized value for the restricted log likelihood is then 
A 
Ps = —(Tn/2) log(2m) ~ (Tn/2) — (T/2) logiSuy| — (T/2) D log(1 - Aj). 
ist 


A likelihood ratio test of the null hypothesis that the h cointegrating relations only 
involve D’y, against the alternative hypothesis that the A cointegrating relations 
could involve any elements of y, would then be 


A A 
ee — £4) = —T Dd log(1 - A) + TD log(t - A,). [2.3.14] 
i=l i=l 


In this case, the null hypothesis involves only coefficients on /(0) variables 
(the error-correction terms z, = A’y,), and standard asymptotic distribution theory 
turns out to apply. Johansen (1988, 1991) showed that the likelihood ratio statistic 
[20.3.14] has an asymptotic x? distribution with h:(n — q) degrees of freedom. 

For illustration, consider the restriction represented by [20.3.10] that the 
exchange rate has a coefficient of zero in the cointegrating vector [20.3.9]. From 
[20.3.11] and [20.3.12], we calculate 


5... - [427-366 805.812 
~ [805.812 1525.45 
—0.484857 —0.837701 


—1.81401 —2.46896 
— 1.80836 —3.58991 


Zuv 


The eigenvalues for the matrix in [20.3.13] are then 


Ay 


0.1059 A, = 0.04681, 
with 
T log(t — A;) = —21.15  T log(i — A,) = —9.06. 


20.3. Hypothesis Testing 649 


The likelihood ratio statistic [20.3.14] is 
224 — Lf) = 22.12 — 21.15 
= 0.97. 
The degrees of freedom for this statistic are 
h(n — q) = 1:3 - 2) = 1; 


the null hypothesis imposes a single restriction on the cointegrating vector. The 
5% critical value for a y?(1) variable is seen from Table B.2 to be 3.84. Since 
0.97 < 3.84, the null hypothesis that the exchange rate does not appear in the 
cointegrating relation is accepted. The restricted cointegrating vector (normalized 
with the coefficient on the U.S. price level to be unity) is 


a; = [1.00 0.00 —0.54}. 


As asecond example, consider the hypothesis that originally suggested interest 
in a possible cointegrating relation between these three variables. This was the 
hypothesis that the real exchange rate is stationary, or that the cointegrating vector 
is proportional to (1, —1, —1)'. For this hypothesis, D' = (1, —1, —1) and 


Sw = 88.5977 
—0.145914 
Spy =| 3.61422 
“ 0.312582 


In this case, the matrix [20.3.13] is the scalar 0.0424498, and so A, = 0.0424498 
and T log(1 — A,) = —8.20. Thus, the likelihood ratio test of the null hypothesis 
that the cointegrating vector is proportional to (1, —1, —1)' is 


2(£4 — L3) = 22.12 — 8.20 
= 13.92. 
In this case, the degrees of freedom are 
he(n — gq) = 1:3 - 1) =2. 


The 5% critical value for a x7(2) variable is 5.99. Since 13.92 > 5.99, the null 
hypothesis that the cointegrating vector is proportional to (1, ~ 1, —1)' is rejected. 


Other Hypothesis Tests 


A number of other hypotheses can be tested in this framework. For example, 
Johansen (1991) showed that the null hypothesis that there are no deterministic 
time trends in any of the series can be tested by taking twice the difference between 
[20.2.10] and [20.2.48]. Under the null hypothesis, this likelihood ratio statistic is 
asymptotically y? with g = n — h degrees of freedom. Johansen also discussed 
construction of Wald-type tests of hypotheses involving the cointegrating vectors. 

Not all hypothesis tests about the coefficients in Johansen’s framework are 
asymptotically y?, Consider an error-correction VAR of the form of [20.2.1] where 
Co = —BA’. Suppose we are interested in the null hypothesis that the last n3 
elements of y, fail to Granger-cause the first n, elements of y,. Toda and Phillips 
(forthcoming) showed that a Wald test of this null hypothesis can have a nonstand- 
ard distribution. See Mosconi and Giannini (1992) for further discussion. 


650 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Comparison Between FIML and Other Approaches 


Johansen’s FIML estimation represents the short-run dynamics of a system 
in terms of a vector autoregression in differences with the error-correction vector 
Z,., added. Short-run dynamics can also be modeled with what are sometimes 
called nonparametric methods, such as the Bartlett window used to construct the 
fully modified Phillips-Hansen (1990) estimator in equation [19.3.53]. Related non- 
parametric estimators have been proposed by Phillips (1990, 1991a), Park (1992), 
and Park and Ogaki (1991). Park (1990) established the asymptotic equivalence of 
the parametric and nonparametric approaches, and Phillips (1991a) discussed the 
sense in which any FIML estimator is asymptotically efficient. Johansen (1992) 
provided a further discussion of the relation between limited-information and full- 
information estimation strategies. 

In practice, the parametric and nonparametric approaches differ not just in 
their treatment of short-run dynamics but also in the normalizations employed. 
The fact that Johansen’s method seeks to estimate the space of cointegrating re- 
lations rather than a particular set of coefficients can be both an asset and a liability. 
It is an asset if the researcher has no prior information about which variables appear 
in the cointegrating relations and is concerned about inadvertently normalizing 
@,, = 1 when the true value of a,, = 0. On the other hand, Phillips (1991b) has 
stressed that if the researcher wants to make structural interpretations of the sep- 
arate cointegrating relations, this logically requires imposing further restrictions on 
the matrix A’. 

For example, let r, denote the nominal interest rate on 3-month corporate 
debt, i, the nominal interest rate on 3-month government debt, and 7, the 3-month 
inflation rate. Suppose that these three variables appear to be J(1) and exhibit two 
cointegrating relations. A natural view is that these cointegrating relations represent 
two stabilizing relations. The first reflects forces that keep the risk premium sta- 
tionary, so that 


r= wi + Nb, + Zh, [20.3.15} 


with z}, ~ I(0). A second force is the Fisher effect, which tends to keep the real 
interest rate stationary: 


™ = Wd + Yale + ZH; [20.3.16] 


with z3, ~ 1(0). The system of [20.3.15] and [20.3.16] will be recognized as an 
example of Phillips’s (1991a) triangular representation [19.1.20] for the vector y, = 
(r,, , i,)'. Thus, in this example theoretical considerations suggest a natural or- 
dering of variables for which the normalization used by Phillips would be of par- 
ticular interest for structural inference—the coefficients yj, and y, tell us about 
the risk premium, and the coefficients 43, and +, tell us about the Fisher effect. 


20.4. Overview of Unit Roots—To Difference 
or Not to Difference? 


The preceding chapters have explored a number of issues in the statistical analysis 
of unit roots. This section attempts to summarize what all this means in practice. 

Consider a vector of variables y, whose dynamics we would like to describe 
and some of whose elements may be nonstationary. For concreteness, let us assume 
that the goal is to characterize these dynamics in terms of a vector autoregression. 

One option is to ignore the nonstationarity altogether and simply estimate 
the VAR in levels, relying on standard t and F distributions for testing any hy- 


20.4. Overview of Unit Roots—To Difference or Not to Difference? 651 


potheses. This strategy has the following features to recommend it. (1) The pa- 
rameters that describe the system’s dynamics are estimated consistently. (2) Even 
if the true model is a VAR in differences, certain functions of the parameters and 
hypothesis tests based on a VAR in levels have the same asymptotic distribution 
as would estimates based on differenced data. (3) A Bayesian motivation can be 
given for the usual t or F distributions for test statistics even when the classical 
asymptotic theory for these statistics is nonstandard. 

A second option is routinely to difference any apparently nonstationary var- 
iables before estimating the VAR. If the true process is a VAR in differences, then 
differencing should improve the small-sample performance of all of the estimates 
and eliminate altogether the nonstandad asymptotic distributions associated with 
certain hypothesis tests. The drawback to this approach is that the true process 
may not be a VAR in differences. Some of the series may in fact have been 
stationary, or perhaps some linear combinations of the series are stationary, as in 
a cointegrated VAR. In such circumstances a VAR in differenced form is misspe- 
cified. 

Yet a third approach is to investigate carefully the nature of the nonstation- 
arity, testing each series individually for unit roots and then testing for possible 
cointegration among the series. Once the nature of the nonstationarity is under- 
stood, a stationary representation for the system can be estimated. For example, 
suppose that in a four-variable system we determine that the first variable y,, is 
stationary while the other variables (y2,, ys, and y,,) are each individually J(1). 
Suppose we further conclude that y2,, y3,, and y,, are characterized by a single 
cointegrating relation. For y2, = (y2,, Ys Ya)’, this implies a vector error-correction 
representation of the form * 


) _ le] ‘- Bs eal Viet Fr: Be all Yist-2 i 
Ayn, a Co TL Ayar-1 CD 6D TL Ay2.-2 
(p -1) (p-1) (0) 
[geo deollant]+ Lieb [ah 
2 Y2e—p41 c E2, 

where the (4 x 3) matrix [s)] is restricted to be of the form ba’ where b is (4 X 1) 
and a’ is (1 x 3). Such a system can then be estimated by adapting the methods 
described in. Section 20.2, and most hypothesis tests on this system should be 
asymptotically x?. 

The disadvantage of the third approach is that, despite the care one exercises, 
the restrictions imposed may still be invalid—the investigator may have accepted 
a null hypothesis even though it is false, or rejected a null hypothesis that is actually 
true. Moreover, alternative tests for unit roots and cointegration can produce 
conflicting results, and the investigator may be unsure as to which should be fol- 
lowed. 

Experts differ in the advice offered for applied work. One practical solution 
is to employ parts of all three approaches. This eclectic strategy would begin by 
estimating the VAR in levels without restrictions. The next step is to make a quick 
assessment as to which series are likely nonstationary. This assessment could be 
based on graphs of the data, prior information about the series and their likely 
cointegrating relations, or any of the more formal tests discussed in Chapter 17. 
Any nonstationary series can then be differenced or expressed in error-correction 
form and a stationary VAR could then be estimated. For example, to estimate a 
VAR that includes the log of income (y,) and the log of consumption (c,), these 
two variables might be included in a stationary VAR as Ay, and (c, — y,). If the 
VAR for the data in levels form yields similar inferences to those for the VAR in 


652 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


stationary form, then the researcher might be satisfied that the results were not 
governed by the assumptions made about unit roots. If the answers differ, then 
some attempt to reconcile the results should be made. Careful efforts along the 
lines of the third strategy described in this section might convince the investigator 
that the stationary formulation was misspecified, or alternatively that the levels 
results can be explained by the appropriate asymptotic theory. A nice example of 
how asymptotic theory could be used to reconcile conflicting findings was provided 
by Stock and Watson (1989). Alternatively, Christiano and Ljungqvist (1988) pro- 
posed simulating data from the estimated levels model, and seeing whether incor- 
rectly fitting such simulated data with the stationary specification would spuriously 
produce the results found when the stationary specification was fitted to the actual 
data. Similarly, data could be simulated from the stationary model to see if it could 
account for the finding of the levels specification. If we find that a single specifi- 
cation can account for both the levels and the stationary results, then our confidence 
in that specification increases. 


APPENDIX 20.A. Proof of Chapter 20 Proposition 


@ Proof of Proposition 20.1. 
(a) First we show that A, < 1 fori = 1, 2,...,,. Any eigenvalue A of [20.1.8] 
satisfies 


EHZyxEExy — Al,| = 0. 


Since Zyy is positive definite, this will be true if and only if 


|AZyy — Zyx2xKZxvl = 0. (20.A.1] 
But from the triangular factorization of 2 in equation [4.5.26], the matrix 
Zyy — Zyx2at2xy (20.A.2] 


is positive definite. Hence, the determinant in (20.A.1] could not be zero at A = 1, Note 
further that 


AZyy ~ ZyxBxdZxy = (A — YE yy + [Zyy — Lyx BetZxy]. (20.A.3] 


If A > 1, then the right side of expression (20.A.3] would be the sum of two positive definite 
matrices and so would be positive definite. The left side of [20.A.3] would then be positive 
definite, implying that the determinant in [20.A.1] could not be zero for A > 1. Hence, 
A = 1 is not consistent with (20.A.1]. 

To see that A; = 0, notice that if A were less than zero, then AZyy would be a negative 
number times a positive definite matrix so that AZyy — Ly,xZxfZyy would also be a negative 
number times a positive definite matrix. Hence, the determinant in [20.A.1] could not be 
zero for any value of A < 0. 


Parallel arguments establish that 0 = uw, <1 forj = 1,2,...,m. 
(b) Let k, be an eigenvector associated with a nonzero eigenvalue A, of [20.1.8]: 
FZyxDolExyk, = A/k;. [20.4.4] 
Premultiplying both sides of [20.4.4] by Zyy results in 
[ZxyZyy2yx 2x ][Zxyvk)] = A[Exyk,]. (20.A.5] 


But [2x,k,] cannot be zero, for if [Zyyk,] did equal zero, then the left side of [20.A.4] 
would be zero, implying that A; = 0. Thus, (20.A.5] implies that A, is also an eigenvalue of 
the matrix [ZyyZyJZyxZxx] associated with the eigenvector [Zyyk,]. Recall further that 
eigenvalues are unchanged by transposition of a matrix: 


[Zxylrlyxtxt)’ = DellxylyZyx, 


which is the matrix [20.1.12]. This proves that if A, is a nonzero eigenvalue of [20.1.8], then 
it is also an eigenvalue of [20.1.12]. Exactly parallel calculations show that if y, is a nonzero 
eigenvalue of (20.1.12], then it is also an eigenvalue of [20.1.8]. 


Appendix 20.A. Proof of Chapter 20 Proposition 653 


(c) Premultiply [20.1.10] by k;Zyy: 


kj Zyx2xk2yyk, = AkjZyyk,. (20.A.6] 
Similarly, replace i with j in [20.1.10]: 
DH lyxZxklxvk, = Ajk;, (20.A.7] 
and premultiply by k/Zyy; 
Ki Zyx2xk2xyk; = Ajk/Zyyk,. (20.A.8] 
Subtracting [20.4.8] from [20.A.6], we see that 
0 = (A; — Ak; Zyyk,. (20.A.9] 


If i # j, then A; # A, and [20.A.9] establishes that k;Zyyk, = 0 for i # j. For i = j, we 
normalized k/Zyyk, = 1 in [20.1.11]. Thus we have established condition [20.1.3] for the 
case of distinct eigenvalues. 

Virtually identical calculations show that (20.1.13] and [20.1.14] imply [20.1.4]. 

(d) Transpose [20.1.13] and postmultiply by Zxyk,: 


a! DxylpiDyxLalExyk, = A/a/Exvk,. [20.A.10] 
Similarly, premultiply [20.4.7] by a/Zyy: 
al ZxyZyfZyx2glEavk; = A/asExvk). (20.A.11] 


Subtracting (20.A.11] from (20.A.10] results in 
0 = (A, — A, alExvk,. 


This shows that a/Zxyk; = 0 for A, # A,, as required by [20.1.5]. 
To find the value of a/Zxyk, for i = j, premultiply [20.1.13] by a/2xx, making use 
of [20.1.14]: 
aj ZxyZyyZyxa; = Aj. (20.A.12] 


Let us suppose for illustration that 7, is the smaller of n, and 7; that is, n = n,.5 Then the 
matrix of eigenvectors X is (n x ) and nonsingular. In this case, [20.1.3] implies that 


Zyy = MK, 
or, taking inverses, 
ly = XX’. (20.A.13] 
Substituting (20.A.13] into (20.A.12], we find that 
a; LyyHH'Zyxa; = Aj. (20.A.14] 


Now, 
a ZyyH = afZxylk, k, -°* ky] 
= [afZevk, afZyyk, --* alZxyk, -*- afZxyk,] — (20.A.15] 
=[0 0 +--+ afSxyk, --* Oj. 
Substituting [20.A.15] into (20.A.14], it follows that 
(ajZxyk,)? = A, 
Thus, the ith canonical correlation, 
1, = ajZxyk,, 
is given by the square root of the eigenvalue A,, as claimed: 
r=), &@ 


3In the converse case when m = n,, a parallel argument can be constructed using the fact that 


KiDyx2ztEavk, = Aj. 


654 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


Chapter 20 Exercises 


20.1. In this problem you are asked to verify the claim in the text that the first canonical 
variates 7,, and &,, represent the linear combinations of y, and x, with maximum possible 
correlation. Consider the following maximization problem: 

max E(kiy,x;a)) 


{1.0} 
subject to 
E(kiy,y;k,) =1 
E(a}x,x/a,) = 1. 
Show that the maximum value achieved for this problem is given by the square root of the 
largest eigenvalue of the matrix Z{2xy2yyZyx, and that a, is the associated eigenvector 


normalized as stated. Show that k, is the normalized eigenvector of Ty{ZyxLxyLxy, as- 
sociated with this same eigenvalue. 


20.2. It was claimed in the text that the maximized log likelihood function under the null 
hypothesis of h cointegrating relations was given by [20.3.2]. What is the nature of the 
restriction on the VAR in [20.3.1] when 4 = 0? Show that the value of [20.3.2] for this case 
is the same as the log likelihood for a VAR(p — 1) process fitted to the differenced data 
Ay,. 

20.3. It was claimed in the text that the maximized log likelihood function under the 
alternative hypothesis of n cointegrating relations was given by [20.3.3]. This case involves 
regressing Ay, on a constant, y,_,, amd Ay,_,, Ay,., ..., AY,-,4, without restric- 
tions. Let @, denote the residuals from this unrestricted regression, with Ze, = 
(1/T)=7., #,8/. Equation [11.1.32] would then assert that the maximized log likelihood 
function should be given by 


L% = —(Tni2) log(2m) — (T/2) loglSgg| — (Tn/2). 
Show that this number is the same as that given by formula [20.3.3]. 
20.4. Consider applying Johansen’s likelihood ratio test to univariate data (n = 1). Show 


that the test of the null hypothesis that y, is nonstationary (hk = 0) against the alternative 
that y, is stationary (hk = 1) can be written 


T[log(6%) — log(67)], 
where 6? is the average squared residual from a regression of Ay, on a constant and Ay,_,, 


Ay,-2, +++, Ay;-p41 while 6} is the average squared residual when y,_, is added as an 
explanatory variable to this regression. 


Chapter 20 References 


Ahn, S. K., and G. C. Reinsel. 1990. “Estimation for Partially Nonstationary Multivariate 
Autoregressive Models.” Journal of the American Statistical Association 85:813—23. 
Christiano, Lawrence J., and Lars Ljungqvist. 1988. ‘Money Does Granger-Cause Output 
in the Bivariate Money-Output Relation.” Journal of Monetary Economics 22:217-35. 
Johansen, Sgren. 1988. “Statistical Analysis of Cointegration Vectors.” Journal of Economic 
Dynamics and Control 12:231-54. 

, 1991. “Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian 
Vector Autoregressive Models.” Econometrica 59:1551-80. 

. 1992. ‘‘Cointegration in Partial Systems and the Efficiency of Single-Equation Anal- 
ysis.” Journal of Econometrics 52:389-402. 

and Katarina Juselius. 1990. “Maximum Likelihood Estimation and Inference on 
Cointegration—with Applications to the Demand for Money.” Oxford Bulletin of Econom- 
ics and Statistics 52:169—210. ; 

Koopmans, Tjalling C., and William C. Hood. 1953. “The Estimation of Simultaneous 
Linear Economic Relationships,” in William C. Hood and Tjalling C. Koopmans, eds., 
Studies in Econometric Method. New York: Wiley. 


Chapter 20 References 655 


Mosconi, Rocco, and Carlo Giannini. 1992. “Non-Causality in Cointegrated Systems: Rep- 
resentation, Estimation and Testing,” Oxford Bulletin of Economics and Statistics. 54:399- 
417. 

Park, Joon Y. 1990. “Maximum Likelihood Estimation of Simultaneous Cointegrated Models.” 
University of Aarhus. Mimeo. 

. 1992, “Canonical Cointegrating Regressions.” Econometrica 60:119-43. 

and Masao Ogaki. 1991. “Inference in Cointegrated Models Using VAR Prewhi- 
tening to Estimate Shortrun Dynamics.” University of Rochester. Mimeo. 

Phillips, Peter C. B. 1990. ‘Spectral Regression for Cointegrated Time Series,” in William 
Barnett, James Powell, and George Tauchen, eds., Nonparametric and Semiparametric 
Methods in Economics and Statistics. New York: Cambridge University Press. 

. 1991a. “Optimal Inference in Cointegrated Systems.” Econometrica 59:283-306. 

. 1991b. “Unidentified Components in Reduced Rank Regression Estimation of 
ECM’s.” Yale University. Mimeo. 

and Bruce E. Hansen. 1990. “Statistical Inference in Instrumental Variables Regres- 
sion with I(1) Processes.” Review of Economic Studies 57:99-125. 

and §. Ouliaris. 1990. ‘““Asymptotic Properties of Residual Based Tests for Coin- 
tegration.” Econometrica 58:165—93. 

Stock, James H., and Mark W. Watson. 1988. “Testing for Common Trends.” Journal of 
the American Statistical Association 83:1097-1107. 

and . 1989. “Interpreting the Evidence on Money-Income Causality.” Journal 
of Econometrics 40:161-81. 

Toda, H. Y., and Peter C. B. Phillips. Forthcoming. ‘““Vector Autoregression and Causality.” 
Econometrica, 

West, Kenneth D. 1988. “‘Asymptotic Normality, When Regressors Have a Unit Root.” 
Econometrica 56:1397—1417, 


656 Chapter 20 | Maximum Likelihood Analysis of Cointegrated Systems 


21 


Time Series Models 
of Heteroskedasticity 


21.1. Autoregressive Conditional 
Heteroskedasticity (ARCH) 


An autoregressive process of order p (denoted AR(p)) for an observed variable 
y, takes the form 


Ye = C+ PY 1 + G22 t+ + OpVr-p + Hh, [21.1.1] 


where u, is white noise: 


E(u,) = 0 [21.1.2] 
o fort=T 
Eta) = {° otherwise. eee] 
The process is covariance-stationary provided that the roots of 
1 - $2z- dz? --:- — gz? = 0 


are outside the unit circle. The optimal linear forecast of the level of y, for an 
AR(p) process is 


E(yly¥s—1 Ye-25 oe ) = C+ GiYr-1 + Goye-2 to + DpYr- ps [21.1.4] 


where Ey ly Y:-2, - - -) denotes the linear projection of y, on a constant and 
(Yr-1> Ye-25 - - +)» While the conditional mean of y, changes over time according 
to [21.1.4], provided that the process is covariance-stationary, the unconditional 
mean of y, is constant: 


E(y,) = (1 — & ~ d2 — ++" ~ be) 


Sometimes we might be interested in forecasting not only the level of the 
series y, but also its variance. For example, Figure 21.1 plots the federal funds 
rate, which is an interest rate charged on overnight loans from one bank to another. 
This interest rate has been much more volatile at some times than at others. Changes 
in the variance are quite important for understanding financial markets, since 
investors require higher expected returns as compensation for holding riskier assets. 
A variance that changes over time also has implications for the validity and effi- 
ciency of statistical inference about the parameters (c, ¢,, ¢2., ..., ¢,) that 
describe the dynamics of the level of y,. 

Although [21.1.3] implies that the unconditional variance of u, is the constant 
o”, the conditional variance of u, could change over time. One approach is to 


6457 


25 


0.0 
ss Ey :) 61 64 67 70 73 76 73 82 8s 88 


FIGURE 21.1 U.S. federal funds rate (monthly averages quoted at an annual 
rate), 1955-89, 


describe the square of u, as itself following an AR(m) process: 
u2 = f+ aur, + agur, +++ + Gyr, + We, [21.1.5] 
where w, is a new white noise process: 


E(w,) = 0 
2) fort =r 
EGS {0 otherwise. 


Since u, is the error in forecasting y,, expression [21.1.5] implies that the linear 
projection of the squared error of a forecast of y, on the previous m squared forecast 
errors is given by 


E(u2|u2_,, u2.2,..-) = £ + aur. + aur, t-+++a,,u?_,,. [21.1.6] 


A white noise process u, satisfying [21.1.5] is described as an autoregressive con- 
ditional heteroskedastic process of order m, denoted u, ~ ARCH(m). This class of 
processes was introduced by Engle (1982).! 

Since u, is random and u? cannot be negative, this can be a sensible repre- 
sentation only if [21.1.6] is positive and [21.1.5] is nonnegative for all realizations 
of {u,}. This can be ensured if w, is bounded from below by — { with ¢ > 0 and if 
a, = Oforj = 1,2,...,m. In order for u? to be covariance-stationary, we further 
require that the roots of 


1 — az — az? - +++ — az" =0 
‘A nice survey of ARCH-related models was provided by Bollerslev, Chou, and Kroner (1992). 


658 Chapter 21 | Time Series Models of Heteroskedasticity 


lie outside the unit circle. If the a; are all nonnegative, this is equivalent to the 
requirement that 


a+ testa, <1. [21.1.7] 
When these conditions are satisfied, the unconditional variance of u, is given by 
o? = E(u?) = G1 -a,-a@--+--- @,,). [21.1.8] 
Let 4?,,|, denote an s-period-ahead linear forecast: 
a2, = E(u?,,|u?, u21,...). 
This can be calculated as in [4.2.27] by iterating on 
(422 -—o)= (07, ja - oa) + 2 (07, ja): — a?) 
ttt + On(A2.;— mle — 77) 
forj = 1,2,...,5 where 
a,=u2  forrst. 


The s-period-ahead forecast 4?, ,,, converges in probability to 0? as s—> ~, assuming 
that w, has finite variance and that [21.1.7] is satisfied. 

It is often convenient to use an alternative representation for an ARCH (m) 
process that imposes slightly stronger assumptions about the serial dependence of 
u,. Suppose that 


u, = Vh,-v,, [21.1.9] 
where {v,} is an i.i.d. sequence with zero mean and unit variance: 
E(v,) =0 E(v?) = 1. 
If h, evolves according to 
h, = £ + ayuz_y + aqu?.g t+ + aun, (21.1.10] 
then [21.1.9] implies that 
E(u?|u—4, 2, ---) = + aur, + au? t+ +++ +.a,u2_,,. [21.111] 


Hence, if u, is generated by [21.1.9] and [21.1.10], then u, follows an ARCH(m) 
process in which the linear projection [21.1.6] is also the conditional expectation. 

Notice further that when [21.1.9] and [21.1.10] are substituted into [21.1.5], 
the result is 


h,-v2 = hy + w,. 


Hence, under the specification in [21.1.9], the innovation w, in the AR(m) rep- 
resentation for u? in [21.1.5] can be expressed as 


w, = hy (v? - 1). i (21.1.12] 
Note from [21.1.12] that although the unconditional variance of w, was assumed 
to be constant, 
E(w?) = 22, (21.1.13] 
the conditional variance of w, changes over time. 
The unconditional variance of w, reflects the fourth moment of u,, and this 


fourth moment does not exist for all stationary ARCH models. One can see this 
by squaring [21.1.12] and calculating the unconditional expectation of both sides: 


E(w?) = E(h?)-E(v? - 1). [21.1.14] 


21.1. Autoregressive Conditional Heteroskedasticity (ARCH) 659 


Taking the ARCH (1) specification for illustration, we find with a little manipulation 
of the formulas for the mean and variance of an AR(1) process that 


E(h?) Eg ce a,u?_,)? 
= E{(aj-ut_1) + (2a, ¢+u?_1) + £7} 
o3-[Var(u2_,) + [E(u2_,)P] + 2a,¢-E(u2_,) + ¢?  [21.1.15] 


a peor + i Fence 


fn 
a 


Substituting [21.1.15] and [21.1.13] into [21.1.14], we conclude that A? (the ua- 
conditional variance of w,) must satisfy 


az,?2 a 
2: =))| ESS ga 2_ 4)\2 
\ ; a mi REO SAY: [21.1.16] 


Even when |a,| < 1, equation [21.1.16] may not have any real solution for 
A. For example, if v, ~ N(0, 1), then E(v? — 1)? = 2 and [21.1.16] requires that 


(1 — 3a7)A?_ 2? 
1 - aj (1 — a)?” 
This equation has no real solution for A whenever a? = 4. Thus, if u, ~ ARCH(1) 


with the innovations v, in [21.1.9] coming from a Gaussian distribution, then the 
second moment of w, (or the fourth moment of u,) does not exist unless a? < 4. 


Maximum Likelihood Estimation with Gaussian Vv, 


Suppose that we are interested in estimating the parameters of a regression 
model with ARCH disturbances. Let the regression equation be 


Ye = X/B + uy. (21.1.17] 


Here x, denotes a vector of predetermined explanatory variables, which could 
include lagged values of y. The disturbance term u, is assumed to satisfy [21.1.9] 
and [21.1.10]. It is convenient to condition on the first m observations (t = 
—-m+1,—-m+2,...,0)and to use observationst = 1,2,... , T for estimation. 
Let Y, denote the vector of observations obtained through date t: 


= , , , , t 
Y, a (Yo Vena sea Vis Yor ss 6» Vem4ds Xe. Xe-19 0 + + Xt, Kase ees X mst) zy 


If v, ~ i.i.d. N(0, 1) with v, independent of both x, and Y,_,, then the conditional 
distribution of y, is Gaussian with mean x; B and variance h,: 


— = , 2 
FIX Y-1) = ag ox0( SEB), [21.1.18] 


where 


hp = + ay(y-1 — X/-1B)? + @(y,-2 — X28)? + °° 
(Vivien = Xe B [21.1.19] 
= [z,(B)]’S 


660 Chapter 21 | Time Series Models of Heteroskedasticity 


for 
3 = (¢, a1, @,..., &,)! 
[2-(B)]" = (1, Qe-1 — X¢-1B)?, (2 — XF-2BY, «6 6 (Yee — Xf mB?) 
Collect the unknown parameters to be estimated in an (a x 1) vector 0: 
6 = (B’, 8’). 


The sample log likelihood conditional on the first m observations is then 


£(8) 


T 
»» log f(y:1x,, Yo 6) 


— (77/2) log(2m) — (1/2) 5 log(h,) [21.1.20] 


= (1/2) > (y ~ x; B)*/h,. 


For a given numerical value for the parameter vector 0, the sequence of conditional 
variances can be calculated from [21.1.19] and used to evaluate the log likelihood 
function [21.1.20]. This can then be maximized numerically using the methods 
described in Section 5.7. The derivative of the log of the conditional likelihood of 
the tth observation with respect to the parameter vector 8, known as the ¢th score, 
is shown in Appendix 21.A to be given by 


d lo 1X,, Y,_1; 8 
s,(0) = Lue :-13 9) 
(ax 1) 


§ ‘5 ae (21.1.21] 
TO Uy — 7X —j X,U,; VN, 
a ~ vee |® we J+[ aa 


The likelihood function can be maximized using the method of scoring as in Engle 
(1982, p. 997) or using the Berndt, Hall, Hall, and Hausman (1974) algorithm as 
in Bollerslev (1986, p. 317). Alternatively, the gradient of the log likelihood function 
can be calculated analytically from the sum of the scores, 


T 
VE) = > 5,00), 
or numerically by numerical differentiation of the log likelihood [21.1.20]. The 
analytically or numerically evaluated gradient could then be used with any of the 
numerical optimization procedures described in Section 5.7. 

Imposing the stationarity condition (271, a, < 1) and the nonnegativity con- 
dition (a; = 0 for all j) can be difficult in practice. Typically, either the value of 
m is very small or else some ad hoc structure is imposed on the sequence {a,} , 
as in Engle (1982, equation (38)). 


Maximum Likelihood Estimation with Non-Gaussian v, 


The preceding formulation of the likelihood function assumed that v, has a 
Gaussian distribution. However, the unconditional distribution of many financial 
time series seems to have fatter tails than allowed by the Gaussian family. Some 
of this can be explained by the presence of ARCH; that is, even if v, in [21.1.9] 


21.1. Autoregressive Conditional Heteroskedasticity (ARCH) 661 


has a Gaussian distribution, the unconditional distribution of u, is non-Gaussian 
with heavier tails than a Gaussian distribution (see Milhaj, 1985, or Bollerslev, 
1986, p. 313). Even so, there is a fair amount of evidence that the conditional 
distribution of u, is often non-Gaussian as well. 

The same basic approach can be used with non-Gaussian distributions. For 
example, Bollerslev (1987) proposed that v, in [21.1.9] might be drawn from a ¢ 
distribution with v degrees of freedom, where v is regarded as a parameter to be 
estimated by maximum likelihood. If u, has a ¢ distribution with » degrees of 
freedom and scale parameter M,, then its density is given by 


2 —(v +12 
fu) = Nera ure|1 + “| ; [21.1.22] 


where I'(-) is the gamma function described in the discussion following equation 
[12.1.18]. If » > 2, then v, has mean zero and variance? 


E(u?) = M,vi(v — 2). 


Hence, at variable with v degrees of freedom and variance h, is obtained by taking 
the scale parameter M, to be 


M, = hv — 2)», 
for which the density [21.1.22] becomes 


* Tv + 192] 


—(v+1y/2 
FU) = rT (v/2) (vy — 2-H; +: as . (21.1.23] 


hv a 2) 


This density can be used in place of the Gaussian specification [21.1.18] along with 
the same specification of the conditional mean and conditional variance used in 
(21.1.17] and [21.1.19]. The sample log likelihood conditional on the first m ob- 
servations then becomes 


T 
»y log f(y1X, Y,-15 8) 


‘=T tool 1G (v ~ 2-0} - (1/2) > log(h,)  [21.1.24] 


a?T(v/2) 
zs ~ yray2 
- [v + 1)2] »y toe] . EEA 


where 


h, = 0+ aya - x;-1B)? + a,(¥,-2 - x;-2B)? Hitt + On (Yim ~ X;—-mB)* 


= [2,(B)]’6. 


The log likelihood [21.1.24] is then maximized numerically with respect to v, B, 
and 8 subject to the constraint v > 2. 

The same approach can be used with other distributions for v,. Other distri- 
butions that have been employed with ARCH-related models include a Normal- 
Poisson mixture distribution (Jorion, 1988), power exponential distribution (Baillie 
and Bollerslev, 1989), Normal—log normal mixture (Hsieh, 1989), generalized ex- 
ponential distribution (Nelson, 1991), and serially dependent mixture of Normals 
(Cai, forthcoming) or ¢ variables (Hamilton and Susmel, forthcoming). 


See, for example, DeGroot (1970, p. 42). 


662 Chapter 21 | Time Series Models of Heteroskedasticity 


Quasi-Maximum Likelihood Estimation 


Even if the assumption that v, is i.id- N(O, 1) is invalid, we saw in [21.1.6] 

that the ARCH specification can still offer a reasonable model on which to base a 
linear forecast of the squared value of v,. As shown in Weiss (1984, 1986), Bollerslev 
and Wooldridge (1992), and Glosten, Jagannathan, and Runkle (1989), maximi- 
zation of the Gaussian log likelihood function [21.1.20] can provide consistent 
estimates of the parameters ¢, a,, @2,..., @,, of this linear representation even 
when the distribution of u, is non-Gaussian, provided that v, in [21.1.9] satisfies 

E(v,|x,, %,_1) =0 
and 

E(v?|x,, Y,_1) = 1. 
However, the standard errors have to be adjusted. Let 6, be the estimate that 
maximizes the Gaussian log likelihood [21.1.20], and let @ be the true value that 
characterizes the linear representations [21.1.9], [21.1.17], and [21.1.19]. Then even 
when y, is actually non-Gaussian, under certain regularity conditions 

VT(6r - 6) > N(0, D-'SD~), 

where 


T 
S = plim T~* >, [s.(®)]-[s,(0)]' 
for s,(@) the score vector as calculated in [21.1.21], and where 


T 
D = plimT > {2 an 2,..| 


Tox tml 00’ 
Li ee jl, jX,— 
= plim TT! > {ron 2 als | [21.1.25] 
Toe t= z,(B) 


jel 


x 5 —2aju,—;X;-; 8 + cu] alt 


where 


= , , t t , 
y, a (Yes Vena +0 so Vix Yoo + + > Vemaits Xps Xe-t + ++ »Xi Xo.--- Xmen) ° 


The second equality in [21.1.25] is established in Appendix 21.4. The matrix S 
can be consistently estimated by 


8; =T"} >» [s,(6,)]-[s,(67)]’, 


where s,(6,) indicates the vector given in [21.1.21] evaluated at 6. Similarly, the 
matrix D can be consistently estimated by 


DB, = TT" > {ucin|* “ae 


x [5 = 26,0, Xj; 1 + cho] I} 


jal 


21.1. Autoregressive Conditional Heteroskedasticity (ARCH) 663 


Standard errors for 6, that are robust to misspecification of the family of densities 
can thus be obtained from the square root of diagonal elements of 
TD78,Dz!. 


Recall that if the model is correctly specified so that the data were really 
generated by a Gaussian model, then S = D, and this simplifies to the usual 
asymptotic variance matrix for maximum likelihood estimation. 


Estimation by Generalized Method of Moments 


The ARCH regression model of [21.1.17] and [21.1.19] can be characterized 
by the assumptions that the residual in the regression equation is uncorrelated with 
the explanatory variables, 

E[(y, — x; B)x,] = 0, 
and that the implicit error in forecasting the squared residual is uncorrelated with 
lagged squared residuals, 

El(u7 — ,)z] = 0. 


As noted by Bates and White (1988), Mark (1988), Ferson (1989), Simon (1989), 
or Rich, Raymond, and Butler (1991), this means that the parameters of an ARCH 
model could be estimated by generalized method of moments,’ choosing 0 = 
(B’, 3’)’ so ds to minimize 


[g(8; ¥7)]’Sz*[g(0; YI, 
where 


T-* D1 ~ xB 
8(9; U7) = r 
T-* 3 {0% — xB) — [2,(B)]'8}2,(8) 


The matrix §,, standard errors for parameter estimates, and tests of the model can 
be constructed using the methods described in Chapter 14. Any other variables 
believed to be uncorrelated with u, or with (u? — h,) could be used as additional 
instruments. 


Testing for ARCH 


Fortunately, it is simple to test whether the residuals u, from a regression 
model exhibit time-varying heteroskedasticity without actually having to estimate 
the ARCH parameters. Engle (1982, p. 1000) derived the following test based on 
the Lagrange multiplier principle. First the regression of [21.1.17] is estimated by 
OLS for observations t = —~m + 1, ~m + 2,..., T and the OLS sample 
residuals 2, are saved. Next, 0? is regressed on a constant and m of its own lagged 
values: 


a? = £+ a,a2_, + af? ,+--' +,,07.,, + &, (21.1.26] 


fort = 1,2,..., T. The sample size T times the uncentered R2 from the regression 


3As noted in Section 14.4, maximum likelihood estimation can itself be viewed as estimation by 
GMM in which the orthogonality condition is that the expected score is zero. 


664 Chapter 21 | Time Series Models of Heteroskedasticity 


of [21.1.26] then converges in distribution to a x? variable with m degrees of freedom 
under the null hypothesis that u, is actually iid. N(O, 7). 

Recalling that the ARCH(m) specification can be regarded as an AR(m) 
process for u?, another approach developed by Bollerslev (1988) is to use the Box- 
Jenkins methods described in Section 4.8 to analyze the autocorrelations of u?. 
Other tests for ARCH are described in Bollerslev, Chou, and Kroner (1992, p. 8). 


21.2. Extensions 


Generalized Autoregressive Conditional 
Heteroskedasticity (GARCH) 


Equations [21.1.9] and [21.1.10] described an ARCH(m) process (u,) char- 
acterized by 


uy = Viiv, 
where 2, is i.i.d. with zero mean and unit variance and where h, evolves according 
to 
h, = € + aur, + aquz_g + +++ + QU? p- 
More generally, we can imagine a process for which the conditional variance de- 
pends on an infinite number of lags of u?_,, 


h, = £+ w(L)u?, [21.2.1] 
where 
a(L) = > aL’. 
jul 


A natural idea is to parameterize a(L) as the ratio of two finite-order polynomials: 
a(L) — abitaLl?+---+a,Lb" 
1-8(L) 1-611 - &L? —---- 8b” 


where for now we assume that the roots of 1 — 5(z) = 0 are outside the unit 
circle. If [21.2.1] is multiplied by 1 — 8(L), the result is 


[1 — 5(L)]h, = [1 - 8()]o + a(L)u? 


a(L) = [21.2.2] 


or 
h, = « + Sjhyy + Oghy-2 + °° + 8A, [21.2.3] 
+ aur, + aur.g ttt + Oy? in 
forx = [1 -— 6, — & — ++: — 6,]¢. Expression [21.2.3] is the generalized au- 


toregressive conditional heteroskedasticity model, denoted u, ~ GARCH(r, m), 
proposed by Bollerslev (1986). 

One’s first guess from expressions [21.2.2] and [21.2.3] might be that 5(L) 
describes the “‘autoregressive” terms for the variance while a(L) captures the 
“moving average” terms. However, this is not the case. The easiest way to see 
why is to add u? to both sides of [21.2.3] and rewrite the resulting expression as 


h, + wu? = « — 6(u2.1 — Ay-1) — 6(u?_-2 — h-2) — 17> 
— 6(u7_, — Ay.) + Su?_y + bur. + °° 
+ 5.u2_, + ayu2_, + aqu2.gt:+++a,,u2_,, + u? 


21.2. Extensions 665 


or 
u2 = K + (8, + a,)u2_, + (8 + aur, +--- 


[21.2.4] 
+ (6, ar a, )u?_, + w, — 6,W,21 — Wj - 5,W, 75 


where w, = u? ~— h, and p = max{m, r}. We have further defined 6,=Oforj>r 
and a, = 0 for j > m. Notice that h, is the forecast of u? based on its own lagged 
values and thus w, = u? — h, is the error associated with this forecast. Thus, w, is 
a white noise process that is fundamental for u?. Expression [21.2.4] will then be 
recognized as an ARMA(p, r) process for u?, in which the jth autoregressive 
coefficient is the sum of 5; plus a; while the jth moving average coefficient is the 
negative of 5,. If u, is described by a GARCH(r, m) process, then u? follows an 
ARMA(p, r) process, where p is the larger of r and m. 

The nonnegativity requirement is satisfied if « > 0 and a, = 0, 4; = 0 for 


j= 1,2,..., p. From our analysis of ARMA processes, it then follows that u? 
is covariance-stationary provided that w, has finite variance and that the roots of 
1 — (6, + @)z — (8 + @)z? ~ +--+ — (8, + a,)z? = 0 


are outside the unit circle. Given the nonnegativity restriction, this means that 
u? is covariance-stationary if 


(5, + a) + (& + a) +--+ + (6, + a) <1. 
Assuming that this condition holds, the unconditional mean of u? is 
E(u?).= o? = «/[1 — (6; + a) — (6 + a) — +--+ — (8, + @,)]. 


Nelson and Cao (1992) noted that the conditions a, = 0 and 5, = 0 are sufficient 
but not necessary to ensure nonnegativity of h,. For example, for a GARCH(1, 2) 
process, the 7(L) operator implied by [21.2.2] is given by 


m(L) = (1 — 6,L)"(a,L + aL?) 

(1+ 6,.L + 63L?2 + 813 + ---)(aL + aL?) 
aL + (5a, + @)L? + 6,(6;@, + a,)L3 

+ 83(S,a, + a)L4 +---. 


The 7, coefficients are all nonnegative provided that 0 = 5, < 1, a, = 0, and 
(5,a, + a) = 0. Hence, a, could be negative as long as — a is less than 5,a,. 

The forecast of u?,, based on u?,u?_,,... , denoted 4?,.,),, can be calculated 
as in [4.2.45] by iterating on 


(5 + 1) (07, 5-10 a a?) a (8 + @2)(0?, 5-212 “3 a”) 
BR oe (8, + a, )(A?+s—ple = o”) — 5,W, — 8,411 


Oey — OF = 4-2 — BW ras, fors=1,2,...,r 
(5; + @,)(0?, 6-11 a o”) + (6, + ay) (47, 2}, = a) 
+--++(5,+4,)(4?,5-5.- 97) fors=rt+i1,r+2,..., 
a2, = u? for7St 
W,=u2-02,., forr=t,t—1,...,t-r41. 


See Baillie and Bollerslev (1992) for further discussion of forecasts and mean 
squared errors for GARCH processes. 

Calculation of the sequence of conditional variances {h,}7_, from [21.2.3] 
requires presample values for h_,4,,..., Mo and u2,,,,..., uj. If we have 


666 Chapter 21 | Time Series Models of Heteroskedasticity 


observations on y, and x, fort = 1,2,..., T, Bollerslev (1986, p. 316) suggested 
setting 

h,=u?7=6 forj=—-p+i1,...,90, 

where 


T 
6 = TD (y, — xB). 
t=1 


The sequence {h,}7_, can be used to evaluate the log likelihood from the 
expression given in [21.1.20]. This can then be maximized numerically with respect 
to B and the parameters x, 5,,..., 5,, @,... , @» of the GARCH process; for 
details, see Bollerslev (1986). 


Integrated GARCH 


Suppose that u, = Vh,-v,, where v, is i.i.d. with zero mean and unit variance 
and where h, obeys the GARCH(r, m) specification 


h, = « + 8A, + bhyg t+ + Oh, 
+ ayu?_, + aquzang ts: + @,,U7_, 


We saw in [21.2.4] that this implies an ARMA process for u? where the jth au- 
toregressive coefficient is given by (8, + a). This ARMA process for u? would 
have a unit root if 


> ot > eat, [21.2.5] 
j= jot 
Engle and Bollerslev (1986) referred to a model satisfying [21.2.5] as an integrated 
GARCH process, denoted JGARCH. 

If u, follows an JGARCH process, then the unconditional variance of u, is 
infinite, so neither u, nor u? satisfies the definition of a covariance-stationary proc- 
ess. However, it is still possible for u, to come from a strictly stationary process in 
the sense that the unconditional density of u, is the same for all t; see Nelson (1990). 


The ARCH-in-Mean Specification 


Finance theory suggests that an asset with a higher perceived risk would pay 
a higher return on average. For example, let r, denote the ex post rate of return 
on some asset minus the return on a.safe alternative asset. Suppose that r, is 
decomposed into a component anticipated by investors at date t — 1 (denoted y,) 
and a component that was unanticipated (denoted u,): 


r, = pb, + Uy. 


Then the theory suggests that the mean return (,) would be related to the variance 
of the return (h,). In general, the ARCH-in-mean, or ARCH-M, regression model 
introduced by Engle, Lilien, and Robins (1987) is characterized by 


y, = xB + 5h, + u, 


u, = Vh,v, 
h, i= g + Oyu?) + OnU?_2 Baca 3 OU? m 


for v, i.i.d. with zero mean and unit variance. The effect that higher perceived 
variability of u, has on the level of y, is captured by the parameter 5. 


21.2. Extensions 667 


Exponential GARCH 


As before, let u, = Vh,-v, where 2, is i.i.d. with zero mean and unit variance. 
Nelson (1991) proposed the following model for the evolution of the conditional 
variance of u,: 


log h, = £+ Dy a;'{|v,_;| — Elv,_;| + Xv,-j}- [21.2.6] 
- 


Nelson’s model is sometimes referred to as exponential GARCH, or EGARCH. If 
at; > 0, Nelson’s model implies that a deviation of |v,_;| from its expected value 
causes the variance of u, to be larger than otherwise, an effect similar to the idea 
behind the GARCH specification. 

The & parameter allows this effect to be asymmetric. If X = 0, then a positive 
surprise (v,_; > 0) has the same effect on volatility as a negative surprise of the 
same magnitude. If —-1 < ® < 0, a positive surprise increases volatility less than 
a negative surprise. If X < —1, a positive surprise actually reduces volatility while 
a Negative surprise increases volatility. A number of researchers have found evi- 
dence of asymmetry in stock price behavior—negative surprises seem to increase 
volatility more than positive surprises.* Since a lower stock price reduces the value 
of equity relative to corporate debt, a sharp decline in stock prices increases cor- 
porate leverage and could thus increase the risk of holding stocks. For this reason, 
the apparent finding that X < 0 is sometimes described as the leverage effect. 

One of.the key advantages of Nelson’s specification is that since [21.2.6] 
describes the log of h,, the variance itself (h,) will be positive regardless of whether 
the 7; coefficients are positive. Thus, in contrast to the GARCH model, no re- 
strictions need to be imposed on [21.2.6] for estimation. This makes numerical 
optimization simpler and allows a more flexible class of possible dynamic models 
for the variance. Nelson (1991, p. 351) showed that [21.2.6] implies that log h,, h,, 
and u, are all strictly stationary provided that 37,77? < ~, 

A natural parameterization is to model 2r(L) as the ratio of two finite-order 
polynomials as in the GARCH(r, m) specification: 


log h, = k + 6, log h,_; + 8, logh,2+ °°: 
+ 6, logh,-, + af{lY%-1| — Elvi] + Rua [21.2.7] 
7 anf{|v;-2| mo E|v,_2l + Xv, gee ok 
+ ay{|V,-m| — Elvsml + XY;-mb- 
The EGARCH model can be estimated by maximum likelihood by specifying 


a density for v,. Nelson proposed using the generalized error distribution, normal- 
ized to have zero mean and unit variance: 


v exp[ — (1/2)|v,/A|"] 


F(%) = A- 2+ IT (1/v) [21.2.8] 


Here I'(-) is the gamma function, A is a constant given by 


ae zero 
- T(3/v) : 


4See Pagan and Schwert (1990), Engle and Ng (1991), and the studies cited in Bollerslev, Chou, 
and Kroner (1992, p. 24). 


668 Chapter 21 | Time Series Models of Heteroskedasticity 


and v is a positive parameter governing the thickness of the tails. For »y = 2, the 
constant A = 1 and expression [21.2.8] is just the standard Normal density. If 
v < 2, the density has thicker tails than the Normal, whereas for vy > 2 it has 
thinner tails. The expected absolute value of a variable drawn from this distribution 
is 
A:2”T(2/v) 

T(i/v) 


For the standard Normal case (v = 2), this becomes 
Elv,| = V2/r. 


As an illustration of how this model might be used, consider Nelson’s analysis 
of stock return data. For r, the daily return on stocks minus the daily interest rate 
on Treasury bills, Nelson estimated a regression model of the form 


r, = a+ br, + dh, + uy. 


Elv,| = 


The residual u, was modeled as Vih,-v,, where 2, is i.i.d. with density [21.2.8] and 
where h, evolves according to 


log h, ~ = 6,(log A,-1 — $-1) + &,(log h,-2 - %-2) 
+ af{|v,-1| — Elv,—s| + 8v,-1} [21.2.9] 
te af|v,-2l a E|v,-2| + Nv, _2}- 
Nelson allowed ¢,, the unconditional mean of log h,, to be a function of time: 
& = ¢ + log(1 + pN,), 


where N, denotes the number of nontrading days between dates t — 1 and ¢ and 
¢ and p are parameters to be estimated by maximum likelihood. The sample log 
likelihood is then 


& = T{log(v/A) — (1 + v=) log(2) — log{T'(1/)]} 
T T 
— (1/2) Pale. — a — br,_1 — dh, (A Vh)|” — (1/2) > log(h,). 


The sequence {h,}7_, is obtained by iterating on [21.2.7] with 
v, = (%, — @ — br, — 5h,)/Vh, 


and with presample values of log A, set to their unconditional expectations {,. 


Other Nonlinear ARCH Specifications 


Asymmetric consequences of positive and negative innovations can also be 
captured with a simple modification of the linear GARCH framework. Glosten, 
Jagannathan, and Runkle (1989) proposed modeling u, = Wh," v,, where v, is i.i.d. 
with zero mean and unit variance and 


h, = k + bh, + aqu?_y + Nu2, Dy. [21.2.10] 


Here, [,_, = lifu,_, = 0 and J,_,; = Oifu,_, < 0. Again, if the leverage effect 
holds, we expect to find X < 0. The nonnegativity condition is satisfied provided 
that 6; = O and a, + X =O. 

A variety of other nonlinear functional forms relating h, to {u,-1, u,;-2, . - -} 
have been proposed. Geweke (1986), Pantula (1986), and Milhgj (1987) suggested 


21.2. Extensions 669 


a specification in which the log of h, depends linearly on past logs of the squared 
residuals. Higgins and Bera (1992) proposed a power transformation of the form 


hy = [6% + cy(uz_1)? + an(uz_2)® +++ + OUP), 


with { > 0, 5 > O, and a; = 0 fori = 1,2, ..., m. Gourieroux and Monfort 
(1992) used a Markov chain to model the conditional variance as a general stepwise 
function of past realizations. 


Multivariate GARCH Models 


The preceding ideas can also be extended to an (n X 1) vector y,. Consider 
a system of n regression equations of the form 
Y: = Il’ ; Xx, + U,, 
(nx 1) (2 xk) (ex1) (2x1) 
where x, is a vector of explanatory variables and u, is a vector of white noise 


residuals. Let H, denote the (n X n) conditional variance-covariance matrix of the 
residuals: 


H, = E(u,u;|y,—1, Ye-29- + + 9 Xp Xt - .). 


Engle and Kroner (1993) proposed the following vector generalization of a 
GARCH(r, m) specification: 


H, = K + A,H,_,A; + AjH,_2A) + +> + + AH, A; + Ayu, iw; Ai 
+ Aju, _2U;_2A3 aa x A, Uy — mB) —mAm: 


Here K, A,, and A, fors = 1,2,... denote (n X n) matrices of parameters. An 
advantage of this parameterization is that H, is guaranteed to be positive definite 
as long as K is positive definite, which can be ensured numerically by parameterizing 
K as PP’, where P is a lower triangular matrix. 

In practice, for reasonably sized n it is necessary to restrict the specification 
for H, further to obtain a numerically tractable formulation. One useful special 
case restricts A, and A, to be diagonal matrices fors = 1,2,....In such a model, 
the conditional covariance between u, and u,;, depends only on past values of 
Ujy—.'Uj,-,, and not on the products or squares of other residuals. 

Another popular approach introduced by Bollerslev (1990) assumes that the 
conditional correlations among the elements of u, are constant over time. Let 
h® denote the row i, column i element of H,. Thus, h{ represents the conditional 
variance of the ith element of u,: 


hY = E(uily-1 Ye-29 + + > Xr Xt .). 


This conditional variance might be modeled with a univariate GARCH(1, 1) process 
driven by the lagged innovation in variable i: 


h® = «+ BAL") + au?,_y. 


We might postulate n such GARCH specifications (i = 1,2,..., 1), one for each 
element of u,. The conditional covariance between u,, and u,,, or the row i, column 
j element of H,, is then taken to be a constant correlation p;; times the conditional 
standard deviations of u;, and u,,: 


AY = E(Uittj Yet» Yr-20 ©» Bo Bas» -) = pay VAD VhO. 


Maximum likelihood estimation of this specification turns out to be quite tractable; 
see Bollerslev (1990) for details. 


670 Chapter 21 | Time Series Models of Heteroskedasticity 


Other multivariate models include a formulation for vech(H,) proposed by 
Bollerslev, Engle, and Wooldridge (1988) and the factor ARCH specifications of 
Diebold and Nerlove (1989) and Engle, Ng, and Rothschild (1990). 


Nonparametric Estimates 


Pagan and Hong (1990) explored a nonparametric kernel estimate of the 
expected value of u?. The estimate is based on an average value of those u2 whose 


preceding values of u,_,,U,_2,.. . , Up were “‘close” to the values that preceded 
2. 
u?: 
T 
h, = > w,(t)-u?. 
ral 
ret 


The weights {w,(t)}7_, 4, are a set of (T — 1) numbers that sum to unity. If the 


values of u,_1, U,-2,..., U,—m that preceded u, were similar to the values u,_,, 
Uy.2, ++. , Ur—m that preceded u,, then u? is viewed as giving useful information 
about h, = E(u?|u,_,, u,-2, . +. U;—m)- In this case, the weight w,(t) would be 


large. If the values that preceded u, are quite different from those that preceded 
u,, then u? is viewed as giving little information about h, and so w,(t) is small. One 
popular specification for the weight w,(t) is to use a Gaussian kernel: 


x,(t) = I] (2)~“7A;"* exp[—(u,-; — u,-;)*/(2A?)]. 
jz 
The positive parameter A, is known as the bandwidth. The bandwidth calibrates 
the distance between u,_; and u,_,—the smaller is A,, the closer u,_, must be to 
u,_, before giving the value of u? much weight in estimating h,. To ensure that the 
weights w,(t) sum to unity, we take 


k,(t) 
; ; 


> «(4 


ral 
THe 


wW, (t) = 


The key difficulty with constructing this estimate is in choosing the bandwidth 
parameter A,. One approach is known as cross-validation. To illustrate this ap- 
proach, suppose that the same bandwidth is selected for each lag (A; = A for j = 
1,2,..., m). Then the nonparametric estimate of h, is implicitly a function of 
the bandwidth parameter imposed, and accordingly could be denoted h,(A). We 
might then choose A so as to minimize 


» [u? — h,(a)}. 


Semiparametric Estimates 


Other approaches to describing the conditional variance of u, include general 
series expansions for the function hk, = h(u,_1, u,-2,- . .) as in Pagan and Schwert 
(1990, p. 278) or for the density f(v,) itself as in Gallant and Tauchen (1989) and 
Gallant, Hsieh, and Tauchen (1989). Engle and Gonzalez-Rivera (1991) combined 
a parametric specification for h, with a nonparametric estimate of the density of v, 
in [21.1.9]. 


21.2. Extensions 671 


Comparison of Alternative Models 
of Stock Market Volatility 


A number of approaches have been suggested for comparing alternative ARCH 
specifications. One appealing measure is to see how well different models of het- 
eroskedasticity forecast the value of u?. Pagan and Schwert (1990) fitted a number 
of different models to monthly U.S. stock returns from 1834 to 1925. They found 
that the semiparametric and nonparametric methods did a good job in sample, 
though the parametric models yielded superior out-of-sample forecasts. Nelson’s 
EGARCH specification was one of the best in overall performance from this com- 
parison. Pagan and Schwert concluded that some benefits emerge from using par- 
ametric and nonparametric methods together. 

Another approach is to calculate various specification tests of the fitted model. 
Tests can be constructed from the Lagrange mutiplier principle as in Engle, Lilien, 
and Robins (1987) or Higgins and Bera (1992), on moment tests and analysis of 
outliers as in Nelson (1991), or on the information matrix equality as in Bera and 
Zuo (1991). Related robust diagnostics were developed by Bollerslev and Woold- 
ridge (1992). Other diagnostics are illustrated in Hsieh (1989). Engle and Ng (1991) 
suggested some particularly simple tests of the functional form of h, related to 
Lagrange multiplier tests, from which they concluded that Nelson’s EGARCH 
specification or Glosten, Jagannathan, and Runkle’s modification of GARCH de- 
scribed in [21.2.10] best describes the asymmetry in the conditional volatility of 
Japanese stock returns. 

Engle and Mustafa (1992) proposed another approach to assessing the use- 
fulness of a given specification of the conditional variance based on the observed 
prices for security options. These financial instruments give an investor the right 
to buy or sell the security at some date in the future at a price agreed upon today. 
The value of such an option increases with the perceived variability of the security. 
If the term for which the option applies is sufficiently short that stock prices can 
be approximated by Brownian motion with constant variance, a well-known formula 
developed by Black and Scholes (1973) relates the price of the option to investors’ 
perception of the variance of the stock price. The observed option prices can then 
be used to construct the market’s implicit perception of k,, which can be compared 
with the specification implied by a given time series model. The results of such 
comparisons are quite favorable to simple GARCH and EGARCH specifications. 
Studies by Day and Lewis (1992) and Lamoureux and Lastrapes (1993) suggest 
that GARCH(1, 1) or EGARCH(1, 1) models can improve on the market’s implicit 
assessment of h,. Related evidence in support of the GARCH(1, 1) formulation 
was provided by Engle, Hong, Kane, and Noh (1991) and West, Edison, and Cho 
(1993). 


APPENDIX 21.A. Derivation of Selected Equations for Chapter 21 


This appendix provides the details behind several of the assertions in the text. 
® Derivation of [21.1.21]. Observe that 
a log F(x, M15 @) _ _ 1a log h, 


a0 2 a [21.4.1] 
_ 1 i ay, =e x, B)? =e (y, = x, B)? ah, 
2h, 20 rr’? 


672 Chapter 21 | Time Series Models of Heteroskedasticity 


But 
ay, Se x, B)? & ae | 


= ; [21.A.2] 
and 
a(c +> ant, 
ah, _ int 
a0 a0 
= 0f/00 + 0a,/00)-u?_, + >, a; (du2_,/a0 
: >! dias 2 peer) (21.A.3] 
0 0 0 = 2 _/X,, 
1 0 0 a 0 
=] O f+] wip t---+] 0 J+ Da, 0 
A j=l . 
‘0 0 u2_, 0 


m 


— 2a,U,_;%,_; 
t 
z,(B) 


Substituting (21.A.2] and {21.A.3] into [21.A.1] produces 


i= 


m 


a log f(y:1%,, W135 @ is -{t Ae a 2 ~ Zaytt, )%,-j 4. hare 
oe 2,(B) ae 


as claimed. @ 
® Derivation of [21.1.25]. Expression {21.A.1] can be written 
2 2 
so) = 5 {if — | 200M _ taut 


h, a0 2h, 08" 
from which 
45,8) _ 1alogh, [1 ou? _ uf a} gl fub _ ,| Blog h, 
a’ 2 8h, 26"? 28’ * 2 [h, a0 26’ (21.A.4] 


1 au? au? 1 ah, 


~ 3h, 00.26" * 30 2h? 20" 
From expression {21.A.2], 


ur — | —2x, ou, 
3600’ =| 0 | a8 


= es I: 


Substituting this and (21.A.2] into (21.A.4] results in 


a,(6)_lalogh, Ji, , , 2, up ah, 1 Ju? a? log h, 
“7 2 ra 2u,x; 0°] 2 +3; 1 


_ 1 j2x,x, 0 re —2x,u,|_1 ah, 
2h,| 0 0 0 | 2h? 20" 


Recall that conditional on x, and on Y,_,, the magnitudes A, and x, are nonstochastic 
and 


{21.A.5] 


E(u|x,, ,_1) a 0 
E(u?|x,, Y,-1) = h,. 


Appendix 21.A. Derivation of Selected Equations for Chapter 21 673 


Thus, taking expectations of [21.A.5] conditional on x, and %,_, results in 


5} 38) ae - _lalogh,alogh, 1)xx! 0 
a0’ OS t 2 a6 36’ h, 0 


m 


0 
_3 2 ~ 2ay tt, Xj 
2h? z,(B) 


_1ijxx, 0 
h,| 0 of” 


where the last equality follows from {21.A.3]. = 


i 
0 
“= 


> — 2a, _jX;-; rc | 


j=l 


Chapter 21 References 


Baillie, Richard T., and Tim Bollerslev. 1989. “The Message in Daily Exchange Rates: A 
Conditional Variance Tale.” Journal of Business and Economic Statistics 7:297-305. 

and . 1992, “Prediction in Dynamic Models with Time-Dependent Conditional 
Variances.” Journal of Econometrics 52:91-113. 

Bates, Charles, and Halbert White. 1988. ‘‘Efficient Instrumental Variables Estimation of 
Systems of Implicit Heterogeneous Nonlinear Dynamic Equations with Nonspherical Er- 
rors,” in William A. Barnett, Ernst R. Berndt, and Halbert White, eds., Dynamic Econ- 
ometric Modeling. Cambridge, England: Cambridge University Press. 

Bera, Anil K., and X. Zuo, 1991, “Specification Test for a Linear Regression Model with 
ARCH Process.” University of Illinois at Champaign-Urbana. Mimeo. 

Berndt, E. K., B. H. Halli, R. E. Halli, and J. A. Hausman. 1974. ‘Estimation and Inference 
in Nonlinear Structural Models.” Annals of Economic and Social Measurement 3:653—-65, 
Black, Fischer, and Myron Scholes. 1973. “The Pricing of Options and Corporate Liabili- 
ties.” Journal of Political Economy 81:637—54. 

Bollerslev, Tim. 1986. ‘Generalized Autoregressive Conditional Heteroskedasticity.” Jour- 
nal of Econometrics 31:307-27. 

. 1987. “A Conditionally Heteroskedastic Time Series Model for Speculative Prices 
and Rates of Return.” Review of Economics and Statistics 69:542—47, 

. 1988. ‘On the Correlation Structure for the Generalized Autoregressive Conditional 
Heteroskedastic Process.” Journal of Time Series Analysis 9:121-31. 

. 1990. ‘Modelling the Coherence in Short-Run Nominal Exchange Rates: A Mul- 
tivariate Generalized ARCH Model.” Review of Economics and Statistics 72:498-505. 

, Ray Y. Chou, and Kenneth F. Kroner. 1992, “ARCH Modeling in Finance: A 
Review of the Theory and Empirical Evidence.” Journal of Econometrics 52:5—59. 

, Robert F. Engle, and Jeffrey M. Wooldridge. 1988. “A Capital Asset Pricing Model 
with Time Varying Covariances.” Journal of Political Economy 96:116-31. 

and Jeffrey M. Wooldridge. 1992. ‘“‘Quasi-Maximum Likelihood Estimation and Infer- 
ence in Dynamic Models with Time Varying Covariances.” Econometric Reviews 11:143- 
72. 

Cai, Jun. Forthcoming. “A Markov Model of Unconditional Variance in ARCH.” Journal 
of Business and Economic Statistics. 

Day, Theodore E., and Craig M. Lewis. 1992. ‘“‘Stock Market Volatility and the Information 
Content of Stock Index Options.” Journal of Econometrics 52:267-87. 

DeGroot, Morris H. 1970. Optimal Statistical Decisions. New York: McGraw-Hill. 
Diebold, Francis X., and Mark Nerlove. 1989. “The Dynamics of Exchange Rate Volatility: 
A Multivariate Latent Factor ARCH Model.” Journal of Applied Econometrics 4:1—21. 
Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of 
the Variance of United Kingdom Inflation.” Econometrica 50:987—1007. 

and Tim Bollerslev. 1986. ““Modellmg the Persistence of Conditional Variances.” 
Econometric Reviews 5:1-S0. 


674 Chapter 21 | Time Series Models of Heteroskedasticity 


and Gloria Gonzalez-Rivera. 1991. “Semiparametric ARCH Models.” Journal of 
Business and Economic Statistics 9:345—59. 

; Ted Hong, Alex Kane, and Jaesun Noh. 1991. “Arbitration Valuation of Variance 
Forecasts Using Simulated Options Markets.” Advances in Futures and Options Research 
forthcoming. 

and Kenneth F. Kroner. 1993. “Multivariate Simultaneous Generalized ARCH.” 
UCSD. Mimeo. 

, David M. Lilien, and Russell P. Robins. 1987. “Estimating Time Varying Risk 
Premia in the Term Structure: The ARCH-M Model.” Econometrica 55:391—407. 

and Chowdhury Mustafa. 1992. “Implied ARCH Models from Options Prices.” 
Journal of Econometrics 52:289-311. 

and Victor K. Ng. 1991. “Measuring and Testing the Impact of News on Volatility.” 
University of California, San Diego. Mimeo. 

, Victor K. Ng, and Michael Rothschild. 1990. “Asset Pricing with a FACTOR- 
ARCH Covariance Structure: Empirical Estimates for Treasury Bills.” Journal of Econo- 
metrics 45:213-37, 

Ferson, Wayne E, 1989. “Changes in Expected Security Returns, Risk, and the Level of 
Interest Rates.” Journal of Finance 44:1191-1218. 

Gallant, A, Ronald, David A. Hsieh, and George Tauchen. 1989. “On Fitting a Recalcitrant 
Series: The Pound/Dollar Exchange Rate 1974-83.” Duke University. Mimeo. 

and George Tauchen. 1989. “Semi Non-Parametric Estimation of Conditionally 
Constrained Heterogeneous Processes: Asset Pricing Applications.” Econometrica 57:1091— 
1120. 

Geweke, John. 1986. “Modeling the Persistence of Conditional Variances: A Comment.” 
Econometric Reviews 5:57-61. 

Glosten, Lawrence R., Ravi Jagannathan, and David Runkle. 1989. “Relationship between 
the Expected Value and the Volatility of the Nominal Excess Return on Stocks.”? North- 
western University. Mimeo. 

Gourieroux, Christian, and Alain Monfort. 1992. “Qualitative Threshold ARCH Models.” 
Journal of Econometrics 52:159-99. 

Hamilton, James D., and Raul Susmei. Forthcoming. “Autoregressive Conditional Het- 
eroskedasticity and Changes in Regime.” Journal of Econometrics. 

Higgins, M: L., and A. K. Bera. 1992. ‘‘A Class of Nonlinear ARCH Models.” International 
Economic Review 33:137~S8. 

Hsieh, David A. 1989, “‘Modeling Heteroscedasticity in Daily Foreign-Exchange Rates.” 
Journal of Business and Economic Statistics 7:307-17. 

Jorion, Philippe. 1988. ‘On Jump Processes in the Foreign Exchange and Stock Markets.” 
Review of Financial Studies 1:427-45. 

Lamoureux, Christopher G., and William D. Lastrapes. 1993. “Forecasting Stock Return 
Variance: Toward an Understanding of Stochastic Implied Volatilities.” Review of Financial 
Studies 5:293~326. : 

Mark, Nelson. 1988. “Time Varying Betas and Risk Premia in the Pricing of Forward Foreign 
Exchange Contracts.” Journal of Financial Economics 22:335-54, 

Milhgj, Anders. 1985. “The Moment Structure of ARCH Processes.” Scandinavian Journal 
of Statistics 12:281-92. 

. 1987. ‘A Multiplicative Parameterization of ARCH Models.” Department of Sta- 
tistics, University of Copenhagen. Mimeo. 

Nelson, Daniel B. 1990. “Stationarity and Persistence in the GARCH(1, 1) Model.” Econ- 
ometric Theory 6:318-34. 

. 1991. “Conditional Heteroskedasticity in Asset Returns: A New Approach,” Econ- 
ometrica 59:347-70. 

and Charles Q. Cao. 1992. “Inequality Constraimts in the Univariate GARCH Model.” 
Journal of Business and Economic Statistics 10:229-35. 

Pagan, Adrian R., and Y. S. Hong. 1990. “Non-Parametric Estimation and the Risk Pre- 
mium,” in W. Barnett, J. Powell, and G. Tauchen, eds. , Semiparametric and Nonparametric 
Methods in Econometrics and Statistics. Cambridge, England: Cambridge University Press. 
Pagan, Adrian R., and G. William Schwert. 1990. “Alternative Models for Conditional 
Stock Volatility.” Journal of Econometrics 45:267-90. 


Chapter 21 References 675 


Pagan, Adrian R., and Aman Ullah. 1988. ‘The Econometric Analysis of Models with Risk 
Terms.” Journal of Applied Econometrics 3:87-105. 

Pantula, Sastry G. 1986. “Modeling the Persistence of Conditional Variances: A Comment.” 
Econometric Reviews 5:71-74. 

Rich, Robert W., Jennie Raymond, and J. S. Butler. 1991. “Generalized Instrumental 
Variables Estimation of Autoregressive Conditional Heteroskedastic Models.” Economics 
Letters 35:179-85. 

Simon, David P. 1989. ‘Expectations and Risk in the Treasury Bill Market: An Instrumental 
Variables Approach.” Journal of Financial and Quantitative Analysis 24:357-66. 

Weiss, Andrew A. 1984. “ARMA Models with ARCH Errors.” Journal of Time Series 
Analysis 5:129-43. 

. 1986. “Asymptotic Theory for ARCH Models: Estimation and Testing.”’ Econo- 
metric Theory 2:107-31. 

West, Kenneth D., Hali J. Edison, and Dongchul Cho, 1993. “A Utility Based Comparison 
of Some Models of Foreign Exchange Volatility.” Journal of International Economics, 
forthcoming. 


676 Chapter 21 | Time Series Models of Heteroskedasticity 


22 


Modeling Time Series 
with Changes in Regime 


22.1. Introduction 


Many variables undergo episodes in which the behavior of the series seems to 
change quite dramatically. A striking example is provided by Figure 22.1, which 
is taken from Rogers’s (1992) study of the volume of dollar-denominated accounts 
held in Mexican banks. The Mexican government adopted various measures in 
1982 to try to discourage the use of such accounts, and the effects are quite dramatic 
in a plot of the series. 

Similar dramatic breaks will be seen if one follows almost any macroeconomic 
or financial time series for a sufficiently long period. Such apparent changes in the 
time series process can result from events such as wars, financial panics, or sig- 
nificant changes in government policies. 

How should we model a change in the process followed by a particular time 
series? For the data plotted in Figure 22.1, one simple idea would be that the 
constant term for the autoregression changed in 1982. For data prior to 1982 we 
might use a model such as 


Ye — Ba = O(M-1 — Hy) t+ Es [22.1.1] 
while data after 1982 might be described by 
Ye — Ba = P(Wr-1 — He) + &, [22.1.2] 


where py < py. 

The specification in [22.1.1] and [22.1.2] seems a plausible description of the 
data in Figure 22.1, but it is not altogether satisfactory as a time series model. For 
example, how are we to forecast a series that is described by [22.1.1] and [22.1.2]? 
If the process has changed in the past, clearly it could also change again in the 
future, and this prospect should be taken into account in forming a forecast. More- 
over, the change in regime surely should not be regarded as the outcome of a 
perfectly foreseeable, deterministic event. Rather, the change in regime is itself a 
random variable. A complete time series model would therefore include a descrip- 
tion of the probability law governing the change from p, to My. 

These observations suggest that we might consider the process to be influenced 
by an unobserved random variable s*, which will be called the state or regime that 
the process was in at date ¢. If s* = 1, then the process is in regime 1, while 
s* = 2 means that the process is in regime 2. Equations [22.1.1] and [22.1.2] can 
then equivalently be written as 


Ye — Map = PCY — Bag) + Ee [22.1.3] 
where ,, indicates 4, when sf = 1 and indicates 4. when sf = 


677 


-3,00 


“3.75 
78 79 80 al @2 83 84 as 


FIGURE 22.1 Log of the ratio of the peso value of dollar-denominated bank 
accounts in Mexico to the peso value of peso-denominated bank accounts in Mexico, 
monthly, 1978-85. (Rogers, 1992). 


We then need a description of the time series process for the unobserved 
variable s*. Since sj‘ takes on only discrete values (in this case, sf is either 1 or 
2), this will be a slightly different time series model from those for continuous- 
valued random variables considered elsewhere in this book. 

The simplest time series model for a discrete-valued random variable is a 
Markov chain. The theory of Markov chains is reviewed in Section 22.2. In Section 
22.4 this theory will be combined with a conventional time series model such as 
an autoregression that is assumed to characterize any given regime. Prior to doing 
so, however, it will be helpful to consider a special case of such processes, namely, 
that for which @ = 0 in [22.1.3] and sf is an i.i.d. discrete-valued random variable. 
Such a specification describes y, as a simple mixture of different distributions, the 
statistical theory for which is reviewed in Section 22.3. 


22.2. Markov Chains 


Let s, be a random variable that can assume only an integer value {1,2,... , N}. 
Suppose that the probability that s, equals some particular value j depends on the 
past only through the most recent value s,_,: 


Pfs, = j|s,-, = i, 5-2 = k,.. J} = Pls, = js =i} = py [22.2.1] 


Such a process is described as an N-state Markov chain with transition probabilities 
{pij}ij=12,...v- The transition probability p;; gives the probability that state i will 
be followed by state j. Note that , 


Pa t+ Pa te++ t+ Pw = 1. [22.2.2] 


678 Chapter 22 | Modeling Time Series with Changes in Regime 


It is often convenient to collect the transition probabilities in an (N x N) 
matrix P known as the transition matrix: 


Pu Pa *** Pm 
Pr P2 *** Pn 

pale = ; [22.2.3] 
Pin Pow °° * PNnn 


The row j, column i element of P is the transition probability p,;, for example, the 
row 2, column 1 element gives the probability that state 1 will be followed by 
state 2. 


Representing a Markov Chain with a Vector Autoregression 


A useful representation for a Markov chain is obtained by letting &, denote 
arandom (N x 1) vector whose jth element is equal to unity ifs, = j and whose 
jth element equals zero otherwise. Thus, when s, = 1, the vector &, is equal to the 
first column of I, (the N x N identity matrix); when s, = 2, the vector &, is the 
second column of I,; and so on: 


(1,0,0,..., 0)’ when s, = 1 
(0,1,0,..., 0)’ when s, = 2 


(0,0,0,...,1)' when s, = N. 


If s, = i, then the jth element of &,,, is a random variable that takes on the 
value unity with probability p,, and takes on the value zero otherwise. Such a 
random variable has expectation p,,. Thus, the conditional expectation of &,,, given 
5, = iis given by 

Pir 


EGal, = i=] 7 |. [22.2.4] 


Pin 
This vector is simply the ith column of the matrix P in [22.2.3]. Moreover, when 
s, = i, the vector &, corresponds to the ith column of I, in which case the vector 
in [22.2.4] could be described as Pé,. Hence, expression [22.2.4] implies that 
E(&+i1&) = P&,, 
and indeed, from the Markov property [22.2.1], it follows further that 
E(Esil& &-1---) = PE. [22.2.5] 
Result [22.2.5] implies that it is possible to express a Markov chain in the 
form 
S41 = P& + Viar, [22.2.6} 
where 


Vio. = Gar — E(&ail&, Sheree [22.2.7] 


Expression [22.2.6] has the form of a first-order vector autoregression for &; note 
that [22.2.7] implies that the innovation v, is a martingale difference sequence. 
Although the vector v, can take on only a finite set of values, on average v, is zero. 
Moreover, the value of v, is impossible to forecast on the basis of previous states 
of the process. 


22.2. Markov Chains 679 


Forecasts for a Markov Chain 
Expression [22.2.6] implies that 


Em = Veem + P¥iem—1 + PB? Vig meg toc + PV, 4, + PE, [22.2.8] 


where P™ indicates the transition matrix multiplied by itself m times. It follows 
from [22.2.8] that m-period-ahead forecasts for a Markov chain can be calculated 
from 


E(&amlé: &-1, mre -) = Pv%,. [22.2.9} 


Again, since the jth element of &,,,, will be unity if s,,,, = j and zero otherwise, 
the jth element of the (N x 1) vector E(&,4ml&» €—1,- - .) indicates the probability 
that 5,4, takes on the value j, conditional on the state of the system at date ¢. For 
example, if the process is in state i at date t, then [22.2.9] asserts that 


PASi4.m a 1|s, = i} 


P{s.4m = 2\s, = i} 
pt | = Pre, [22.2.10} 


PASiam = N\s, = i} 


where e, denotes the ith column of Iy. Expression [22.2.10] indicates that the m- 
period-ahead: transition probabilities for a Markov chain can be calculated by mul- 
tiplying the matrix P by itself m times. Specifically, the probability that an obser- 
vation from regime i will be followed m periods later by an observation from regime 
Jd, P&Stam = jls, = i}, is given by the row j, column i element of the matrix P”. 


Reducible Markov Chains 


For a two-state Markov chain, the transition matrix is 
P = i re | (22.2.11] 


Suppose that p,, = 1, so that the matrix P is upper triangular. Then, once the 
process enters state 1, there is no possibility of ever returning to state 2. In such 
a case we would say that state 1 is an absorbing state and that the Markov chain 
is reducible. : 

More generally, an N-state Markov chain is said to be reducible if there exists 
a way to label the states (that is, a way to choose which state to call state 1, which 
to call state 2, and so on) such that the transition matrix can be written in the form 


r=([3 ¢| 


where B denotes a (K xX K) matrix for some 1 = K < N. If P is upper block- 
triangular, then so is P” for any m. Hence, once such a process enters a state j 
such that j = K, there is no possibility of ever returning to one of the states 
K+1,K+2,...,N. 

A Markov chain that is not reducihle is said to be irreducible. For example, 
a two-state chain is irreducible if p,, < 1 and py < 1. 


680 Chapter 22 | Modeling Time Series with Changes in Regime 


Ergodic Markoy Chains 


Equation [22.2.2] requires that every column of P sum to unity, or 
Pl = 1, [22.2.12} 


where 1 denotes an (N X 1) vector of 1s. Expression (22.2.12] implies that unity 
is an eigenvalue of the matrix P’ and that 1 is the associated eigenvector. Since a 
matrix and its transpose share the same eigenvalues, it follows that unity is an 
eigenvalue of the transition matrix P for any Markov chain. 

Consider an N-state irreducible Markov chain with transition matrix P. Sup- 
pose that one of the eigenvalues of P is unity and that all other eigenvalues of 
P are inside the unit circle. Then the Markov chain is said to be ergodic. The 
(N x 1) vector of ergodic probabilities for an ergodic chain is denoted 7. This 
vector a is defined as the eigenvector of P associated with the unit eigenvalue; 
that is, the vector of ergodic probabilities 7m satisfies 


Pr = 7. (22.2.13} 


The eigenvector 7 is normalized so that its elements sum to unity (1’m = 1). It 
can be shown that if P is the transition matrix for an ergodic Markov chain, then 
lim P” = 7-1’. [22.2.14} 
We establish [22.2.14] here for the case when all the eigenvalues of P are 
distinct; a related argument based on the Jordan decomposition that is valid for 
ergodic chains with repeated eigenvalues is developed in Cox and Miller (1965, 
pp. 120-23). For the case of distinct eigenvalues, we know from [A.4.24] that P 
can always be written in the form 


P = TAT"), (22.2. 15] 


where T is an (N x N) matrix whose columns are the eigenvectors of P while A 
is a diagonal matrix whose diagonal contains the corresponding eigenvalues of P. 
It follows as in [1.2.19] that 

m= TA"T". [22.2.16] 


Since the (1, 1) element of A is unity and all other elements of A are inside the 
unit circle, A” converges to a matrix with unity in the (1, 1) position and zeros 
elsewhere. Hence, 


lim P” = x-y’, (22.2.17} 


max 


where x is the first column of T and y’ is the first row of T~!. 
The first column of T is the eigenvector of P corresponding to the unit ei- 
genvalue, which eigenvector was denoted m in [22.2.13}: 


x= a. [22.2.18] 


Moreover, the first row of T~!, when expressed as a column vector, corresponds 
to the eigenvector of P’ associated with the unit eigenvalue, which eigenvector was 
seen to be proportional to the vector 1 in (22.2.12}: 


y=arl. ; [22.2.19} 


To verify [22.2.19}], note from [22.2.15] that the matrix of eigenvectors T of the 
matrix P is characterized by 


PT = TA. [22.2.20} 


22.2. Markov Chains 681 


Transposing [22.2.15] results in 
P' = (TAT, 
and postmultiplying by (T—')’ yields 
P'(T-})' = (T7)'A. (22.2.21] 


Comparing [22.2.21] with [22.2.20] confirms that the columns of (T-!)' correspond 
to eigenvectors of P’. In particular, then, the first column of (T~')’ is proportional 
to the eigenvector of P’ associated with the unit eigenvalue, which eigenvector was 
seen to be given by 1 in equation [22.2.12]. Since y was defined as the first column 
of (T—")', this establishes the claim made in equation [22.2.19}. 

Substituting [22.2.18}] and [22.2.19] into [22.2.17], it follows that 


lim P™ = wal’. 
Since P™ can be interpreted as a matrix of transition probabilities, each column 
must sum to unity. Thus, since the vector of ergodic probabilities m was normalized 
by the condition that I’ = 1, it follows that the normalizing constant a must be 
unity, establishing the claim made in [22.2.14]. 
Result [22.2.14] implies that the long-run forecast for an ergodic Markov 
chain is independent of the current state, since, from [22.2.9], 


E(&aml&es G1 oan -) = pre, 5 m1'&, = 7, 


@é 
where the final equality follows from the observation that 1’, = 1 regardless of 
the value of &,. The long-run forecast of &,,, is given by the vector of ergodic 
probabilities m regardless of the current value of &,. 

The vector of ergodic probabilities can also be viewed as indicating the un- 
conditional probability of each of the N different states. To see this, suppose that 
we had used the symbol 7, to indicate the unconditional probability P{s, = j}. 
Then the vector m = (m7, 7, ..., 7)’ could be described as the unconditional 
expectation of &,: 


n = E(é). [22.2.22] 
If one takes unconditional expectations of [22.2.6], the result is 
E(&41) = P-E(&). 
Assuming stationarity and using the definition [22.2.22], this becomes 
na = P-n, 


which is identical to equation [22.2.13] characterizing m as the eigenvector of P 
associated with the unit eigenvalue. For an ergodic Markov chain, this eigenvector 
is unique, and so the vector of ergodic probabilities m can be interpreted as the 
vector of unconditional probabilities. 

An ergodic Markov chain is a covariance-stationary process. Yet [22.2.6] takes 
the form of a VAR with a unit root, since one of the eigenvalues of P is unity. 
This VAR is stationary despite the unit root because the variance-covariance matrix 
of vy, is singular. In particular, since 1'g, = 1 for all t and since 1'P = 1', equation 
[22.2.6] implies that 1’v, = 0 for all ¢. Thus, from [22.2.19], the first element of 
the (N x 1) vector T-1v, is always zero, meaning that from [22.2.16] the unit 
eigenvalue in Pv, always has a coefficient of zero. 


682 Chapter 22 | Modeling Time Series with Changes in Regime 


Further Discussion of Two-State Markov Chains 


The eigenvalues of the transition matrix P for any N-state Markov chain are 
found from the solutions to |P — AI,| = 0. For the two-state Markov chain, the 
eigenvalues satisfy 


Pu A 1- pe 
1- pu Pa-A 
= (pu — A)(P2 — A) —  - pu) — Pa) 
= PuP2 — (Pu t+ Pa)A +P - 14+ Pu t+ pe - PuP2 
WM — (Pu t+ Pa)A - 1+ py t+ pr 

= (A - 1)A +1- pu - Px). 
Thus, the eigenvalues for a two-state chain are given by A, = 1 and A, = —1 + 
Pu + Poo. The second eigenvalue, A2, will be inside the unit circle as long as 0 < 
Pu + Px <2. We saw earlier that this chain is irreducible as long as p,, < 1 
and pz, < 1. Thus, a two-state Markov chain is ergodic provided that p,, < 1, 


Po <1, and py, + P22 > 0. 
The eigenvector associated with A, for the two-state chain turns out to be 


= i — Pal(2 - Pu - 2 
(1 — puy/(2 — Pu — P22) 


Q= 


(the reader is invited to confirm this and the claims that follow in Exercise 22.1). 
Thus, the unconditional probability that the process will be in regime 1 at any given 
date is given by 


1 Py 
P{s, = 1} = ————-"*_.. 
t, 2- Pu — Px 
The unconditional probability that the process will be in regime 2, the second 


element of 7, is readily seen to be 1 minus this magnitude. The eigenvector as- 
sociated with A, is 


Thus, from [22.2.16], the matrix of m-~period-ahead transition probabilities for an 
ergodic two-state Markov chain is given by 


2 ones eee 1 0 : ‘ | 
Pp” = hi P22 E | —(1 — pi) 1 — pu 
ean er oe - 2—- Pu — Pa 2- Pu - Pa 
2- Pu — Px 
(1 — po) + APC — pu) (1 — pr) — ARC — py) 
- 2- Pu — Px (2 — Pu — P22) 
(1 — pu) — AF — pu) (1 — pu) + AP — py) 
2 — Pir — Pr 2- Pu - Px 


22.2. Markov Chains 683 


Thus, for example if the process is currently in state 1, the probability that m 
periods later it will be in state 2 is given by 


= (1 — pu) — AZ — pu) 


EGie mAseeet) 2— Pu — Pr 


where A. = —1 + py + po. 

A two-state Markov chain can also be represented by a simple scalar AR(1) 
process, as follows. Let é,, denote the first element of the vector &,; that is, &,, is 
a random variable that is equal to unity when s, = 1 and equal to zero otherwise. 
For the two-state chain, the second element of &, is then 1 — é,,. Hence, [22.2.6] 
can be written 


byret |- Pu 4| én |+ vac] 2.2.23 
I, ~ bratt 1- py P22 1—- &, Voeei | i22.2:23] 


The first row of [22.2.23] states that 
Eiee1 = (1 — poo) + (-1 + Pu t+ Padé + Vizsr [22.2.24] 


Expression [22.2.24] will be recognized as an AR(1) process with constant term 
(1 — p22) and autoregressive coefficient equal to(—1 + pi, + p22). Note that this 
autoregressive coefficient turns out to be the second eigenvalue A, of P calculated 
previously. When p,, + pz, > 1, the process is likely to persist in its current state 
and the variable &,, would be positively serially correlated, whereas when p,, + 
P22 < 1, the process is more likely to switch out of a state than stay in it, producing 
negative serial correlation. Recall further from equation [3.4.3] that the mean of 
a first-order autoregression is given by c/(1 — ¢). Hence, the representation [22.2.24] 
implies that 


1 ons 
E(é,) = ares 


which reproduces the earlier calculation of the value for the ergodic probabil- 
ity a7. 


Calculating Ergodic Probabilities 
for an N-state Markov Chain 


For a general ergodic N-state process, the vector of unconditional probabilities 
represents a vector m with the properties that Pa = am and 1't = 1, where 1 
denotes an (N X 1) vector of 1s. We thus seek a vector 7 satisfying 


Aw = @n43- [22.2.25} 


where ey,, denotes the (NV + 1)th column of I,,,, and where 


A cet ceeds 
(N41) xN I 
Such a solution can be found by premultiplying [22.2.25] by (A’A)—!A’: 
; am = (A‘A)~ Alen ai. [22.2.26} 


In other words, a is the (N + 1)th column of the matrix (A’A)71A’. 


684 Chapter 22 | Modeling Time Series with Changes in Regime 


Periodic Markov Chains 


If a Markov chain is irreducible, then there is one and only one eigenvalue 
equal to unity. However, there may be more than one eigenvalue on the unit circle, 
meaning that not all irreducible Markov chains are ergodic. For example, consider 
a two-state Markov chain in which p,, = pa» = 


_ {01 
r= [? 3] 
The eigenvalues of this transition matrix are A, = 1 and A, = ~1, both of which 


are on the unit circle. Thus, the matrix P” does not converge to any fixed limit of 
the form m-1' for this case. Instead, if the process is in state 1 at date ¢, then 


it is certain to be there again for datest + 2,¢ + 4,¢ + 6,..., with no ten- 
dency to converge as m — ©. Such a Markov chain is said to be periodic with 
period 2. 


In general, it is possible to show that for any irreducible N-state Markov 
chain, all the eigenvalues of the transition matrix will be on or inside the unit circle. 
If there are K eigenvalues strictly on the unit circle with K > 1, then the chain is 
said to be periodic with period K. Such chains have the property that the states 
can be classified into K distinct classes, such that if the state at date ¢ is from class 
a, then the state at date ¢ + 1 is certain to be from class a + 1 (where class a + 1 
for a = K is interpreted to be class 1). Thus, there is a zero probability of returning 
to the original state s,, and indeed zero probability of returning to any member of 
the original class @, except at horizons that are integer multiples of the period (such 
as datest + K, t + 2K,¢ + 3K, and so on). For further discussion of periodic 
Markov chains, see Cox and Miller (1965). ; 


22.3. Statistical Analysis of i.i.d. Mixture Distributions 


In Section 22.4, we will consider autoregressive processes in which the parameters 
of the autoregression can change as the result of a regime-shift variable. The regime 
itself will be described as the outcome of an unobserved Markov chain. Before 
analyzing such processes, it is instructive first to consider a special case of these 
processes known as i.i.d. mixture distributions. 

Let the regime that a given process is in at date ¢ be indexed by an unobserved 
random variable s,, where there are N possible regimes (s, = 1, 2,..., or N). 
When the process is in regime 1, the observed variable y, is presumed to have been 
drawn from a N(,, a) distribution. If the process is in regime 2, then y, is drawn 
from a N(2, o3) distribution, and so on. Hence, the density of y, conditional on 
the random variable s, taking on the value j is 


“ 1 ~,; re by)? 
= j;@) = St 22.3. 
firs, J; 8) V206; ep| 2a? [22.3.1] 
forj = 1,2,..., N. Here @ is a vector of population parameters that includes 
Mya +--+ > My and o%,..., 0%. 


The unobserved regime {s,} is presumed to have been generated by some 
probability distribution, for which the unconditional probability that s, takes on 
the value j is denoted 7;: ; 


Pfs, =j;@}= a forj=1,2,...,N. [22.3.2] 


22.3. Statistical Analysis of i.i.d. Mixture Distributions 685 


The probabilities 7, ... , a, are also included in 0; that is, @ is given by 


ala 2 i 
@ = (uy, - ~~ My, OF, -- 5 OR TM, ~~» Tr)! 


Recall that for any events A and B, the conditional probability of A given B 
is defined as 


P{A and B} 
P{B} ’ 


assuming that the probability that event B occurs is not zero. This expression implies 
that the joint probability of A and B occurring together can be calculated as 


P{A and B} = P{A|B}-P{B}. 


P{A|B} = 


For example, if we were interested in the probability of the joint event that s, = j 
and that y, falls within some interval [c, d], this could be found by integrating 


PCY 5 = J) = f(ydls, =f; 0)-Pls, = j; 8} [22.3.3] 
over all values of y, between c and d. Expression [22.3.3] will be called the joint 


density-distribution function of y, and s,. From [22.3.1] and [22.3.2], this 
function is given by 


: TT, —(y, - by)? 
PCY. 5 =f; 8) = Fg orf ES } [22.3.4] 
7 


The unéonditional density of y, can be found by summing [22.3.4] over all 
possible values for j: 


flys ®) = » P(Yu 5; = 73 8) 


— 7 — (1 — tn)? 
/ Vimo; exo 203 [22.3.5] 
7 -(: - bh)? ae 
* V2702 exp| 20% : 


TN -—( — bw)? 
+ — =e. 
V 210; 'N exp| 20%, 
Since the regime s, is unobserved, expression [22.3.5] is the relevant density de- 
scribing the actually observed data y,. If the regime variable s, is distributed i.i.d. 


across different dates t, then the log likelihood for the observed data can be cal- 
culated from [22.3.5] as 


£0) = >> log f(y,; 8). [22.3.6] 


The maximum likelihood estimate of @ is obtained by maximizing [22.3.6] subject 
to the constraints that 7, + m,. +--+ + my = landa,2=Oforj = 1,2,..., 
N. This can be achieved using the numerical methods described in Section 5.7, or 
using the EM algorithm developed later in this section. 

Functions of the form of [22.3.5] can be used to represent a broad class of 
different densities. Figure 22.2 gives an example for N = 2. The joint density- 
distribution p(y,, 5, = 1; 0) is 7, times a N(y;, 07) density, while p(y,, 5, = 2; @) 
is 7, times a N( 2, 73) density. The unconditional density for the observed variable 
f(ye 9) is the sum of these two magnitudes. 


686 Chapter 22 | Modeling Time Series with Changes in Regime 


kw 


ri¢) 


y 
FIGURE 22.2 Density of mixture of two Gaussian distributions with y,|s, = 
1 ~ N(O, 1), y,|s, = 2 ~ N(4, 1), and P{s, = 1} = 0.8. 


A mixture of two Gaussian variables need not have the bimodal appearance 
of Figure 22.2. Gaussian mixtures can also produce a unimodal density, allowing 
skew or kurtosis different from that of a single Gaussian variable, as in Figure 
22.3. 


Inference About the Unobserved Regime 


Once one has obtained estimates of 0, it is possible to make an inference 
about which regime was more likely to have been responsible for producing the 


ky 


i 


FIGURE 22.3 Density of mixture of two Gaussian distributions with y,ls, = 
1~ N(O, 1), ys, = 2 ~ N(2, 8), and Pfs, = 1} = 0.6. 


22.3. Statistical Analysis of i.i.d. Mixture Distributions 687 


date t observation of y,. Again, from the definition of a conditional probability, it 
follows that 


PQ 5 = 130) _ mW f(ds. = 73) 
f(y: 8) f(y. 8) 


Given knowledge of the population parameters @, it would be possible to use 
[22.3.1] and [22.3.5] to calculate the magnitude in [22.3.7] for each observation y, 
in the sample. This number represents the probability, given the observed data, 
that the unobserved regime responsible for observation t was regime j. For example, 
for the mixture represented in Figure 22.2, if an observation y, were equal to zero, 
one could be virtually certain that the observation had come from a N(0, 1) dis- 
tribution rather than a N(4, 1) distribution, so that P{s, = 1|y,; 6} for that date 
would be near unity. If instead y, were around 2.3, it is equally likely that the 
observation might have come from either regime, so that P{s, = 1|y,; 0} for such 
an observation would be close to 0.5. 


Pts, = jlyss 8} = [22.3.7] 


Maximum Likelihood Estimates and the EM Algorithm 


It is instructive to characterize analytically the maximum likelihood estimates 
of the population parameter 8. Appendix 22.A demonstrates that the maximum 
likelihood estimate 6 represents a solution to the following system of nonlinear 
equations: 


T 
>> y. Pls, = jly,; 6} 


t=1 


Ay = 4 forj=1,2,...,N [22.3.8] 
>» Ps, = jly,; 6} 
T 
r > (y, *s f,)?- Pfs, = ily 6} ; 
p? = SP for fp = 1,2,...,N [22.3.9] 
>> Pls, = ily. 6} 
T 
#, = T-! >) Pls, = jly,; 6} = forj = 1,2,...,N. [22.3.10} 


Suppose we were virtually certain which observations came from regime j 
and which did not, so that P{s, = j|y,; @} equaled unity for those observations that 
came from regime j and equaled zero for those observations that came from other 
regimes. Then the estimate of the mean for regime j in [22.3.8] would simply be 
the average value of y, for those observations known to have come from regime j. 
In the more general case where P{s, = jly,; @} is between 0 and 1 for some 
observations, the estimate 4; is a weighted average of all the observations in the 
sample, where the weight for observation y, is proportional to the probability that 
date f’s observation was generated by regime j. The more likely an observation is 
to have come from regime j, the bigger the weight given that observation in esti- 
mating y,. Similarly, 6? is a weighted average of the squared deviations of y, from 
fi;, while #, is essentially the fraction of observations that appear to have come 
from regime j. 

Because equations [22.3.8] to [22.3.10] are nonlinear, it is not possible to 
solve them analytically for 6 as a function of {y;, y2, .-., yz}- However, these 
equations do suggest an appealing iterative algorithm for finding the maximum 


688 Chapter 22 | Modeling Time Series with Changes in Regime 


likelihood estimate. Starting from an arbitrary initial guess for the value of 0, 
denoted @, one could calculate P{s, = j|y,; 8} from [22.3.7]. One could then 
calculate the magnitudes on the right sides of [22.3.8] through [22.3.10] with @© 
in place of 6. The left sides of [22.3.8] through [22.3.10] would then produce a 
new estimate @“- This estimate 8“) could be used to reevaluate P{s, = jly,; 0} 
and recalculate the expressions on the right sides of [22.3.8] through [22.3.10]. The 
left sides of [22.3.8] through [22.3.10] then can produce a new estimate @@. One 
continues iterating in this fashion until the change between 6°"* and @“™ is smaller 
than some specified convergence criterion. 

This algorithm turns out to be a special case of the EM principle developed 
by Dempster, Laird, and Rubin (1977). One can show that each iteration on this 
algorithm increases the value of the likelihood function. Clearly, if the iterations 
reach a point such that 6” = @("*», the algorithm has found the maximum 
likelihood estimate 6. 


Further Discussion 


The mixture density [22.3.5] has the property that a global maximum of the 
log likelihood [22.3.6} does not exist. A singularity arises whenever one of the 
distributions is imputed to have a mean exactly equal to one of the observations 
(4, = y,, say) with no variance (a? — 0). At such a point the log likelihood 
becomes infinite. 

Such singularities do not pose a major problem in practice, since numerical 
maximization procedures typically converge to a reasonable local maximum rather 
than a singularity. The largest local maximum with a, > 0 for all / is described as 
the maximum likelihood estimate. Kiefer (1978) showed that there exists a bounded 
local maximum of [22.3.6] that yields a consistent, asymptotically Gaussian estimate 
of @ for which standard errors can be constructed using the usual formulas such as 
expression [5.8.3]. Hence, if a numerical maximization algorithm becomes stuck 
at a singularity, one satisfactory solution is simply to ignore the singularity and try 
again with different starting values. 

Another approach is to maximize a slightly different objective function such 
as 


Q(8) = £(8) — 5) (a)/2) log(a?) — > b,/(20?) 
a s sa (22.3.11] 
- & cj(m, — w,)?/(207), 


where £(@) is the log likelihood function described in [22.3.6]. If a; = c,, then 
expression (22.3.11] is the form the log likelihood would take if, in addition to the 
data, the analyst had a, observations from regime j whose sample mean was m, and 
whose sample variance was b,/a,. Thus, m, represents the analyst’s prior expectation 
of the value of y,, and b;/a; represents the analyst’s prior expectation of the value 
of o?. The parameters a, and c, represent the strength of these priors, measured 
in terms of the confidence one would have if the priors were based on a, or c, direct 
observations of data known to have come from regime j. See Hamilton (1991) for 
further discussion of this approach. 

Nice surveys of i.i.d. mixture distributions have been provided by Everitt and 
Hand (1981) and Titterington, Smith, and Makov (1985). 


22.3. Statistical Analysis of i.i.d. Mixture Distributions 689 


22.4. Time Series Models of Changes in Regime 


Description of the Process 


We now return to the objective of developing a model that allows a given 
variable to follow a different time series process over different subsamples. As an 
illustration, consider a first-order autoregression in which both the constant term 
and the autoregressive coefficient might be different for different subsamples: 


y= Cs, + Ds, Ye-1 + Er, [22.4.1} 


where ¢, ~ i.i.d. N(O, 7”). The proposal will be to model the regime s, as the 
outcome of an unobserved N-state Markov chain with s, independent of e, for all 
tand 7. 

Why might a Markov chain be a useful description of the process generating 
changes in regime? One’s first thought could be that a change in regime such as 
that in Figure 22.1 is a permanent event. Such a permanent regime change could 
be modeled with a two-state Markov chain in which state 2 is an absorbing state. 
The advantage of using a Markov chain over a deterministic specification for such 
a process is that it allows one to generate meaningful forecasts prior to the change 
that take into account the possibility of the change from regime 1 to regime 2. 

We might also want a time series model of changes in regime to account for 
unusual short-lived events such as World War II. Again, it is possible to choose 
parameters for a Markov chain such that, given 100 years of data, it is quite likely 
that we would have observed a single episode of regime 2 lasting for about 5 years. 
A Markov chain specification, of course, implies that given another 100 years we 
could well see another such event. One might argue that this is a sensible property 
to build into a model. The essence of the scientific method is the presumption that 
the future will in some sense be like the past. 

While the Markov chain can describe such examples of changes in regime, a 
further advantage is its flexibility. There seems some value in specifying a prob- 
ability law consistent with a broad range of different outcomes, and choosing 
particular parameters within that class on the basis of the data alone. 

In any case, the approach described here readily generalizes to processes in 
which the probability that s, = j depends not only on the value of s,_, but also on 
a vector of other observed variables—see Filardo (1992) and Diebold, Lee, and 
Weinbach (forthcoming). 

The general model investigated in this section is as follows. Let y, be an 
(n X 1) vector of observed endogenous variables and x,a (k x 1) vector of observed 
exogenous variables. Let YU, = (Y;, yi-a, © - s Yims Xs Kinyy - ++ Kim)’ be a 
vector containing all observations obtained through date ¢. If the process is governed 
by regime s, = j at date t, then the conditional density of y, is assumed to be given 
by 


Fils: = 7, Xo Ura @), [22.4.2] 


where a is a vector of parameters characterizing the conditional density. If there 
are N different regimes, then there are N different densities represented by [22.4.2] 
forj = 1,2,...,N. These densities will be collected in an (N X 1) vector denoted 
1. 

For the example of [22.4.1], y, is a scalar (n = 1), the exogenous variables 
consist only of a constant term (x, = 1), and the unknown parameters in a consist 
Of cy, ..., Cy, gy, .--, Oy, and ao”, With N = 2 regimes the two densities 


690 Chapter 22 | Modeling Time Series with Changes in Regime 


represented by [22.4.2] are 


exp| -(,.- a - fo 


n= [Pde = bya | _ | V20e 20° 
it erwes . 
(CARA = 2, ¥,-13 &) exp -(.- & - $2y1-1)? 
Vine 20? 


It is assumed in (22.4.2] that the conditional density depends only on the 
current regime s, and not on past regimes: 


fy |X, Y-1,5, = fs a) 


: : 22.4.3 
= f(y lx, Y-15, =J, 5,1 = 1,52 =k,. ..5@), 


though this is not really restrictive. Consider, for example, the specification in 
[22.1.3], where the conditional density of y, depends on both s* and s*_, and where 
s? is described by a two-state Markov chain. One can define a new variable s, that 
characterizes the regime for date t in a way consistent with [22.4.2] as follows: 

5s,=1 ifs* =1ands*.,=1 

5,=2 ifs* = 2ands*_, =1 

5,=3 ifs* = lands}, =2 

5,=4 ifs* =2ands*, =2. 


i}, then s, follows a four-state Markov chain with 


If pj denotes P{s* = j|s*_, 
transition matrix 

Pi O ph O 
Ph O ph 0 

0 ph O ph 

0 pz 0 pr 

Hence, [22.1.3] could be represented as a special case of this framework with 
N = 4, a = (jn; He, 6, a)’ and with [22.4.2] representing the four densities 


1 = op{ 0 = Ha) = (Y-1 — Hat 


ba] 
{ 


F(A y-1, = 1; a) a Vim 


2a? 
fry = 250) = gg er eal Sian wT} 


FU A¥e-1 5 = 35) = agg enp| Oe $0 = wal 


FOr lyn 5, = 4; a) = Fag 09 Oe ta) = $s = wo 


It is assumed that s, evolves according to a Markov chain that is independent 
of past observations on y, or current or past x,: 


Ps, = j|s,-1 = t,5;-2 =k,.. . Xp U,-1} = Pls, =j|5,-1 =i} = py. [22.4.4] 


For generalizations of this assumption, see Lam (1990), Durland and McCurdy 
(1992), Filardo (1992), and Diebold, Lee, and Weinbach (forthcoming). 


22.4. Time Series Models of Changes in Regime 691 


Optimal Inference About Regimes and Evaluation 
of the Likelihood Function 


The population parameters that describe a time series governed by [22.4.2] 
and [22.4.4] consist of a and the various transition probabilities p,. Collect these 
parameters in a vector 8. One important objective will be to estimate the value of 
@ based on observation of Y7. Let us nevertheless put this objective on hold for 
the moment and suppose that the value of @ is somehow known with certainty to 
the analyst. Even if we know the value of 0, we will not know which regime the 
process was in at every date in the sample. Instead the best we can do is to form 
a probabilistic inference that is a generalization of [22.3.7]. In the i.i.d. case, the 
analyst’s inference about the value of s, depends only on the value of y,. In the 
more general class of time series models described here the inference typically 
depends on all the observations available. 

Let P{s, = j|Y,; 0} denote the analyst’s inference about the value of s, based 
on data obtained through date ¢ and based on knowledge of the population pa- 
rameters @. This inference takes the form of a conditional probability that the 
analyst assigns to the possibility that the rth observation was generated by regime 
j. Collect these conditional probabilities P{s, = j|¥,, 6} for j = 1,2,...,N 
in an (N x 1) vector denoted &,,. 

One could also imagine forming forecasts of how likely the process is to be 
in regime j in period ¢ + 1 given observations obtained through date ¢. Collect 
these forecasts in an (N X 1) vector &,4,),, which is a vector whose jth element 
represents P{s,,, = j|Y,; 9}. 

The optimal inference and forecast for each date t in the sample can be found 
by iterating on the following pair of equations: 


g,, = ey) [22.4.5] 
TG 1On) 
See aye =v P-&.. [22.4.6] 


Here n, represents the (N x 1) vector whose jth element is the conditional density 
in [22.4.2], P represents the (N x N) transition matrix defined in [22.2.3], 1 
represents an (N x 1) vector of 1s, and the symbol © denotes element-by-element 
multiplication. Given a starting value &,;) and an assumed value for the population 
parameter vector @, one can iterate on (22.4.5] and [22.4.6] fort = 1,2,...,T 
to calculate the values of &,,, and &,,,), for each date ¢ in the sample. The log 
likelihood function £(@) for the observed data Y; evaluated at the value of @ that 
was used to perform the iterations can also be calculated as a by-product of this 
algorithm from 


T 
£0) = > log f(y: |x, 9-15 9), [22.4.7] 
te] 
where 
FlylX Y,—15 8) = 1(Ey--1 On). [22.4.8] 


We now explain why this algorithm works. 


Derivation of Equations [22.4.5] Through [22.4.8] 


To see the basis for the algorithm just described, note that we have assumed 
that x, is exogenous, by which we mean that x, contains no information about s, 


692 Chapter 22 | Modeling Time Series with Changes in Regime 


beyond that contained in Y,_,. Hence, the jth element of Eh-1 could also be 
described as P{s, = j|x,, Y,-1; 0}. The jth element of 4, is f(y,|s, = j, x, Y,-1; 9). 
The jth element of the (N x 1) vector (&),_1 © m,) is the product of these two 
magnitudes, which product can be interpreted as the conditional joint density- 
distribution of y, and s,: 


Pi{s, = j |X: Y,_1; 0} x fly, |s, = j,X,,U,_1;6) [22.4.9} 
= P(e 5, = j|% Y,—15 9). 


The density of the observed vector y, conditioned on past observables is the sum 
of the N magnitudes in [22.4.9] for j = 1, 2,..., N. This sum can be written in 
vector notation as 


Flyel%s Ue—158) = 1.1 Om), 
as claimed in [22.4.8]. If the joint density-distribution in [22.4.9] is divided by the 
density of y, in [22.4.8], the result is the conditional distribution of s,: 


PY 5, = 11% Ye—13 9) ‘ 
A = Ph, = FY Xs Ue—13 OF 
Fy:1%,, 1:8) M¥e Xe Ma 


= Pfs, = j|Y,; 9}. 
Hence, from [22.4.8], 
PCY: 5: = j1X+1 Ys-15 9) 
Py = [90 = (22.4. 10} 
l’ (Eje-1 On,) 


But recall from [22.4.9] that the numerator in the expression on the right side of 
(22.4. 10] is the jth element of the vector (£,,_, © n,), while the left side of [22.4. 10] 
is the jth element of the vector &,,, Thus, collecting the equations in [22.4.10] for 
j=1,2,..., Ninto an (N x 1) vector produces 


E,, = (En On) 
V (E-1On,)” 
as claimed in [22.4.5]. 


To see the basis for [22.4.6], take expectations of [22.2.6] conditional on 
Ys: 


E(é,.1|Y,) = P-E(é,|¥,) + E(v,41|Y,). (22.4.11] 


Note that v,,, is a martingale difference sequence with respect to ¥,, so that 
[22.4.11] becomes 


Baar a P-é,,,, 
as claimed in [22.4.6]. 


Starting the Algorithm 


Given a starting value a one can use [22.4.5] and [22.4.6] to calculate 
é1, for any t. Several options are available for choosing the starting value. One 
approach is to set E10 equal to the vector of unconditional probabilities m described 
in equation {22.2.26]. Another option is to set 


Exo = > [22.4.12} 


22.4. Time Series Models of Changes in Regime 693 


where p is a fixed (N x 1) vector of nonnegative constants summing to unity, such 
as p = N7!-1. Alternatively, p could be estimated by maximum likelihood along 
with @ subject to the constraint that 1’p = 1 and p, = Oforj = 1,2,...,N. 


Forecasts and Smoothed Inferences for the Regime 


Generalizing the earlier notation, let E,. represent the (N x 1) vector whose 
jth element is P{s, = j|Y,; 0}. Fort > 7, this represents a forecast about the regime 
for some future period, whereas for t < 7 it represents the srmoothed inference 
about the regime the process was in at date ¢ based on data obtained through some 
later date r+. 

The optimal m-period-ahead forecast of &,,,,, can be found by taking expec- 
tations of both sides of [22.2.8] conditional on information available at date t: — 


E(Esm|¥) = P"-E(E,|%,) 
or 
ee = PvE, [22.4.13} 


where é,, is calculated from [22.4.5]. 
Smoothed inferences can be calculated using an algorithm developed by Kim 
(1993). In vector form, this algorithm can be written as 


gir - bi © {P’ lear (+) Evadh (22.4. 14] 


where the sign (+) denotes element-by-element division. The smoothed proba- 
bilities 7 are found by iterating on [22.4.14] backward fort = T — 1, T - 2, 
..., 1, This iteration is started with &7;7, which is obtained from [22.4.5] for 
t = T. This algorithm is valid only when s, follows a first-order Markov chain as 
in [22.4.4], when the conditional density [22.4.2] depends on s,, 5,.,, ... only 
through the current state s,, and when x,, the vector of explanatory variables other 
than the lagged values of y, is strictly exogenous, meaning that x, is independent 
of s, for all ¢ and 7. The basis for Kim’s algorithm is explained in Appendix 22.A 
at the end of the chapter. 


Forecasts for the Observed Variables 


From the conditional density [22.4.2] it is straightforward to forecast y,,, 
conditional on knowing Y,, x,,.1, and s,,,. For example, for the AR(1) specification 
Yeu = Cs,,, + 5,,,¥: + &+1, Such a forecast is given by 


E(yerilSera = J, Yes 0) = cj + Hye [22.4.15] 


There are N different conditional forecasts associated with the N possible values 
for s,,,. Note that the unconditional forecast based on actual observable variables 
is related to these conditional forecasts by 


E(¥:41|Xr+1. Yes 8) 


5 | Year F(Yeni|Xeaa 5 8) dye.1 
N 

= | Yer {3 P(Ve+1s Sear = [Xan Ve a} Ye 4 
j= 


694 Chapter 22 | Modeling Time Series with Changes in Regime 


N 
a ice {3 [f(¥:41[Se41 =f, X41 Us 9)P{s.41 = AXpa1 &,; a} AY 43 
= 


N 
7 p> PiSi41 = i|Xr411 Yes O} | Year Fea lSe41 = A Xeaas M6 ®) dy, 


z 


= , Pls. = f|%,; OVE (Y4118c41 = J, Keer» Yes @). 


i 


Thus, the forecast appropriate for the jth regime is simply multiplied by the prob- 
ability that the process will be in the jth regime, and the resulting N dif- 
ferent products are added together. For example, if the j = 1,2,... , N forecasts 
in [22.4.15] are collected in a (1  N) vector h/, then 


E(yr41145 0) = ae 


Note that although the Markov chain itself admits the linear representation 
[22.2.6], the optimal forecast of y,,, is a nonlinear function of observables, since 
the inference &), in [22.4.5] depends nonlinearly on ¥Y,. Although one may use 
a linear model to form forecasts within a given regime, if an observation seems 
unlikely to have been generated by the same regime as preceding observations, 
the appearance of the outlier causes the analyst to switch to a new rule for forming 
future linear forecasts. 

The Markov chain is clearly well suited for forming multiperiod forecasts as 
well, See Hamilton (1989, 1993b, 1993c) for further discussion. 


Maximum Likelihood Estimation of Parameters 


In the iteration on [22.4.5] and [22.4.6], the parameter vector @ was taken 
to be a fixed, known vector. Once the iteration has been completed for t = 1, 2, 
..., T for a given fixed @, the value of the log likelihood implied by that value 
of @ is then known from [22.4.7]. The value of @ that maximizes the log likelihood 
can be found numerically using the methods described in Section 5.7. 

If the transition probabilities are restricted only by the conditions that p, = 0 
and (pa + pa + *** + pin) = 1 for all i and j, and if the initial probability 
£1)0 is taken to be a fixed value p unrelated to the other parameters, then it is 
shown in Hamilton (1990) that the maximum likelihood estimates for the transition 
probabilities satisfy 


T. —,) [22.4.16] 
> Pis,. = i|¥7; 6} 


where 6 denotes the full vector of maximum likelihood estimates. Thus, the esti- 
mated transition probability p, is essentially the number of times state i seems to 
have been followed by state j divided by the number of times the process was in 
state i. These counts are estimated on the basis of the smoothed probabilities. 

If the vector of initial probabilities p is regarded as a separate vector of 
parameters constrained only by 1'p = 1 and p = 0, the maximum likelihood estimate 
of p turns out to be the smoothed inference about the initial state: 


a 


b = bur. (22.4.17] 


22.4. Time Series Models of Changes in Regime 695 


The maximum likelihood estimate of the vector a that governs the conditional 
density [22.4.2] is characterized by 


@ lo; 
b [2 s doe) § iar = [2.4.18] 
Here 1, is the (NV x 1) vector obtained by vertically stacking the densities in [22.4.2] 
forj = 1,2,..., Nand (0 log n,)/da’ is the (N x k) matrix of derivatives of the 


logs of these densities, where k represents the number of parameters in a. For 
example, consider a Markov-switching regression model of the form 


ye = UB + &, [22.4.19} 


where «, ~ i.i.d. N(O, o?) and where z, is a vector of explanatory variables that 
could include lagged values of y. The coefficient vector for this regression is B, 
when the process is in regime 1, B. when the process is in regime 2, and so on. 
For this example, the vector y, would be 


1 —(y, — 2 B,)? 
V2re exp| ao? 
od : , 


1 (y - Zt By)? 
Vine exp| = Io? } 


and for a =“(Bj, B3,..., By, a)’, condition [22.4.18] becomes 


T 
D (% — 2B; ee Pls, = j|Y75 O} = 0 — forj = 1,2,...,.N [2.4.20] 
t=] 
T N x 
: >» » (y, — 2/8, P-Pis, = |r; 6}. [22.4.21] 


Equation [22.4.20] describes 6 ; as satisfying a weighted OLS orthogonality con- 
dition where each observation is weighted by the probability that it came from 
regime j. In particular, the estimate §; can be found from an OLS regression of 
¥e(j) on 2,(j): 
T ~ T 
- [> ect 3 cnr, [22.4.22] 
where 


¥Ai) = ye VPs, = j|Yr; 9} (22.4.23] 
z,(j) = ZV P{s, = i|Yr; 6}. 


The estimate of o? in (22.4.21] is just (1/T) times the oe sum of the squared 
residuals from these N different regressions. 

' Again, this suggests an appealing algorithm for finding maximum likelihood 
estimates. For the case when p is fixed a priori, given an initial guess for the 
parameter vector 8 one could evaluate [22.4.16], [22.4.22], and [22.4.21] to 
generate a new estimate 0“), One then iterates in the same fashion described in 
equations [22.3.8] through [22.3.10] to calculate @@), @©, ... . This again turns 
out to be an application of the EM algorithm. Alternatively, if p is to be estimated 
by maximum likelihood, equation [22.4.17] would be added to the equations that 
are reevaluated with each iteration. See Hamilton (1990) for details. 


696 Chapter 22 | Modeling Time Series with Changes in Regime 


Illustration: The Behavior of U.S. Real GNP 


As an illustration of this method, consider the data on U.S. real GNP growth 
analyzed in Hamilton (1989). These data are plotted in the bottom panel of Figure 
22.4. The following switching model was fitted to these data by maximum likeli- 
hood: 


Ye 7 Bs = O\(¥,-1 — Hs:_,) + G(y-2 - Hs: _,) 


+ $3(y,-3 — Hse) + $A Y,-4 - Ms; ,) + &, [22.4.24] 
with e, ~ iid. N(0, o”) and with s* presumed to follow a two-state Markov chain 
with transition probabilities p. Maximum likelihood estimates of parameters are 
reported in Table 22.1. In the regime represented by s* = 1, the average growth 
rate is 4, = 1.2% per quarter, while when s* = 2, the average growth rate is 
2 = —0.4%. Each regime is highly persistent. The probability that expansion will 
be followed by another quarter of expansion is pj; = 0.9, so that this regime will 
persist on average for 1/(1 — pj) = 10 quarters. The probability that a contraction 
will be followed by contraction is p} = 0.75, which episodes will typically persist 
for 1/1 — p3) = 4 quarters. 


(a) Probability that economy is in contraction state, or P{s? = 2 Nes Vicars s 
y_4; 6} plotted as a function of t. 


3.2 


(b) Quarterly rate of growth of U.S. real GNP, 1952-84, 


FIGURE 22.4 Output growth and recession probabilities. 
22.4. Time Series Models of Changes in Regime 697 


TABLE 22.1 
Maximum Likelihood Estimates of Parameters for Markov-Switching Model 
of U.S. GNP (Standard Errors in Parentheses) 


f= 116 f= -036 pi=090 pe = 0.75 6? = 0.59 


(0.07) (0.26) (0.04) (0.10) (0.10) 


¢, = 0.01 db, = —0.06 é —0.21 


3 
(0,12) (0.14) (0.11) (0.11) 


| 
a 
N 
a 
ll 


In order to write [22.4.24] in a form where y, depends only on the current 
value of the regime, a variable s, was defined that takes on one of 32 different 


values representing the 32 possible combinations for s*, sf.4,..., Sj.4. For ex- 
ample, s, = 1 when s/f, sf7_,,..., and s7_, all equal 1, s, = 2 when sf = 2 and 
sit_, = +++ = st_, = 1, and so on, The vector &,, calculated from [22.4.5] is thus 


a (32 x 1) vector that contains the probabilities of each of these 32 joint events 
conditional on data observed through date ¢. 

The inference about the value of s/ for a single date t is obtained by summing 
together the relevant joint probabilities. For example, the inference 


P{s? i ee he 43 9} 


=) $, $ $ Pist = 2, si, = ty, Sto = by, St-3 = ts, sig = Ul, 


§=1 H=1 §h=1 el 


iain Vas O} [22.4.25} 


is obtained by iterating on [22.4.5] and [22.4.6] with @ equal to the maximum 
likelihood estimate 6. One then sums together the elements in the even-numbered 
rows of a to obtain P{s* = 2|y.s Vents + > Yaa} OF. 

A probabilistic inference in the form of [22.4.25] can be calculated for each 
date t in the sample. The resulting series is plotted as a function of ¢ in panel (a) 
of Figure 22.4. The vertical lines in the figure indicate the dates at which economic 
recessions were determined to begin and end according to the National Bureau of 
Economic Research. These determinations are made informally on the basis of a 
large number of time series and are usually made some time after the event. 
Although these business cycle dates were not used in any way to estimate param- 
eters or form inferences about s*, it is interesting that the traditional business cycle 
dates correspond fairly closely to the expansion and contraction phases as described 
by the model in (22.4.24]. 


Determining the Number of States 


One of the most important hypotheses that one would want to test for such 
models concerns the number of different regimes N that characterize the data. 
Unfortunately, this hypothesis cannot be tested using the usual likelihood ratio 
test. One of the regularity conditions for the likelihood ratio test to have an asymp- 
totic yx? distribution is that the information matrix § be nonsingular. This condition 
fails to hold if the analyst tries to fit an N-state model when the true process has 
N — 1 states, since under the null hypothesis the parameters that describe the Nth 
state are unidentified. Tests that get around the problems with the regularity con- 
ditions have been proposed by Davies (1977), Hansen (1993), Andrews and Plo- 
berger (1992), and Stinchcombe and White (1993). Another approach is to take 


698 Chapter 22 | Modeling Time Series with Changes in Regime 


the (N — 1)-state model as the null and conduct a variety of tests of the validity 
of that specification as one way of seeing whether an N-state model is needed; 
Hamilton (1993a) proposed a number of such tests. Studies that illustrate the use 
of such tests include Engel and Hamilton (1990), Hansen (1992), and Goodwin 
(1993). 


APPENDIX 22.A. Derivation of Selected Equations for Chapter 22 


™ Derivation of [22.3.8] throngh (22.3.10]. The maximum likelihood estimates are obtained 
by forming the Lagrangean 


J(0) = £0) + AVL — a, — my — ++ = tty) (22.A.1] 


and setting the derivative with respect to @ equal to zero. From [22.3.6], the derivative of 
the log likelihood is given by 


ave) _ < x Flys 8) 
= 2 _ 30 (22.A.2] 
Observe from [22.3.5] that 
af(y;®) _ 1 ex{ =” = al 
on, Ven, ; 207 [22.A.3] 
= f(y ls, =f; 9), 
while 
af(y; 8) ¥, — # . 
“8 = gE x ply, 5, = f; 8) (22.A.4] 
and 
af(y, 8) 1 .,M7 #4 ; 
sar a Oar ae i ad dca [22.A.5] 
Thus, [22.4.2] becomes 
AC) eee : 
yen |S, = J; 8 22.4.6 
om, ere @ fw J; 8) [ ] 
aF (8) q 1 Yr i 
en 2 om 2 5 = F508 22.A,7 
au, AF 8) 7 Ply J; 0) [ ] 
af(e) Si { rare) aa 
aa? a Fast fs 0) 2 oj + 2o4 PUY 5, = i; 6). [22.A.8] 
Recalling [22.3.7], the derivatives in [22.4.6] through [22.A. 8] can be written 
EO) = a5! Pls, = jy [22.4.9] 
7] 
ag 
20) . ee a _ Pfs, = jly,; 9} [22.A,10] 
Op; int 
aL (8) _ - 2, nel tps 
as 3{ 771 + a Pfs, = fly, 8}. [22.A.11] 


Setting the derivative of the Lagrangean in [22.A.1] with respect to wu; equal to zero 
means setting [22.A.10] equal to zero, from which 


T T 
> ye Pfs, = ilys 6} = Hyd Pfs, = ily 6}, 


Appendix 22.A. Derivation of Selected Equations for Chapter 22 699 


Equation [22.3.8] follows immediately from this condition. Similarly, the first-order con- 
ditions for maximization with respect to o? are found by setting [22.A.11] equal to zero: 


T 
2, {-07 + (y. ~ HEPES, = ilys 8} = 0, 
from which [22.3.9] follows. Finally, from [22.A.9], the derivative of [22.A.1] with respect 
to 7; is given by 
3 ) T 
ze = at > Pfs, = ily 6} -A= 0, 
‘TT; =1 
from which 


T 
> Pfs, = jly,; 9} = Any. [22.A.12] 
= 

Summing [22.A.12] over j = 1,2,..., N produces 


T 
> [Pts, = lly; 0} + ++ + Pls, = Nly,; O}] = AC + my + + + tty) 
t=1 
or 
T 
> {1} = A-(1), 
implying that T = A. Replacing A with T in [22.A.12] produces [22.3.10]. ™ 


™ Derivation of [22.4.14]. Recall first that under the maintained assumptions, the regime 
s, depends on‘past observations Y,_, only through the value of s,_,. Similarly, s, depends 
on future observations only through the value of s,,.,: 


Ps, = Ss,41 = 4, M7; 6} = P{s, = ASra1 =i, y,; 6}. [22.A.13] 


The validity of [22.A.13] is formally established as follows (the implicit dependence 
on 8 will be suppressed to simplify the notation), Observe that 


Ps, = SlSeat A i, 443 
= Ps, = Alsat ad i, Yrot. Xi+19 Y} 


2 PY +19 5, = SlSva1 = i, Xan %,) [22.A.14] 
OY stlS41 = 6 Xv Y,) 

= YS =f, Ser = 4b Xo, Y,) Pfs, = ils = i, Xa Vt 

SY r+i1Sea1 = 4 Xa Ye) : 


which simplifies to 
Pfs, = jlsn1 = & Yard = Pls, = i541 = 6 X41 Yd, [22.A.15] 
provided that 
LOIS. = hs Sra = 6 Xo W) = Flr = EM Y,), [22.A.16] 


which is indeed the case, since the specification assumes that y,,, depends on {s,,,, 
5, +» .} only through the current value s,,,. Sifice x is exogenous, [22.A.15] further implies 
that 


Pfs, = fsa. = 69,41} = Pfs, = jls,.. = i, Y. [22.A.17] 
By similar reasoning, it must be the case that 
Pis, = jl. = 1, 9,2 
= Pls, = jlS01 = b Vesa Xe Watt 


= PUYe+2 Se = ilSva1 = i, Xena) Year) 
FW r42/ 5.41 ae i, X,425 ¥,.1) 
= LY 2215, = fy Spur = by Xrg29 Va) Pfs, = ikon = i, X42 Yai 
FO r+al5,41 x i, X42) Y,.1) , 


700 Chapter 22 | Modeling Time Series with Changes in Regime 


which simplifies to 


Pfs, = SlS,41 = 6 4,,2) = Ps, = Sr41 


=i, X42 Voids [22.A.18] 
provided that 


f(ye2ls, = i; Sra = i, X42) 9,1) as F(Y,2 21541 = i, Xy+29 M41). [22.A.19] 
In this case, [22.A.19] is established from the fact that 
Yrs2lS) = fy Set = b Xa Y,.1) 


N 
= >» PY, +25 5142 = ks, = j, 5415 i, Xr425 roa) 


N 

= > [f(y:+215,+2 = k, 5, = 
k=l 
x Pfs, ,2 = kls, = fi Sat 


i Sra. = i, X425 


= i, X42) %,. 3] 


N 
= >> [fy,+2/s,+2 = k, Spe = i X25 41) 


Y,,. 9] 


x Pfs, 42 = k\Sp41 = 1, Xa2, Y 413] 
= FY +2/541 = i, X42» M41) 
Again, exogeneity of x means that [22.A.18] can be written 
Pfs, sas ee ime i, Y 2) = P{s, = flo = i, Lt = Pfs, = flSar = i Y}, 
where the last equality follows from [22.A.17]. 
Proceeding inductively, the same argument can be used to establish that 
P{s, = ils.41 = i, et | = Ps, = ies =. i, y} 
form = 1,2,..., from which [22.A.13] follows. 
Note next that 
=j yy} 
PG, = jlse, = ¥} = PO Ser = 1d 
ts Asie / Pfs... = i|Y} 


= Ps, ~ HY }+ Pls. 41 = ils, = i} 
P65... = 1%) een 


= Dy’ Pfs, = ks} 


7 Pfs, 44 > iY} ° 
It is therefore the case that 


Pfs, = h Stat = i|Y7} = PiS, 41 = i|Y7}- Pls, = USis1 i, 7} 
; = PSs. = gees be , Us [22.A.21] 
» Pu’ 5, = J t 
= P{s,,, = i|¥7} ——_——,, 
tS | r Ps, = i|Y} 
where the second equality follows from (22. A.13] and the third follows from P2. A.20]. 
The smoothed inference for date f is the sum of [22.A.21] overi = 1, 2,..., M: 


Pfs, tad il9x = D3 Pis, a i Sin = i|Y7} 


. _ 
_ = tly, Pe Ps = AYd 
= 2, Por = Ad p= 118) 
Du? Pls,41 = i197 
= Pfs, = j|¥, ———— 
6 = 118) 3 P.1 = 19 [22.A.22] 
Pls, = - jl¥sIp, Pr *** Pind 
PAs. = TY VPiS.41 = 1|%} 
Pis.., = 2|YrV/P{S..1 = 2/9} 


Phs,.4 = NIGP 05 = NIV} 
P{s, = AYP; Eur ( +) Beit) 


Appendix 22.A. Derivation of Selected Equations for Chapter 22 701 


where the (1 x N) vector p; denotes the jth row of the matrix P’ and the sign (+) indicates 
element-by-element division. When the equations represented by [22.A.22] for j = 1, 2, 
..., Nare collected in an (N x 1) vector, the result is 


E,7 = Ey, [O) P(E .ur (+) Es), 


as claimed. ™ 


Chapter 22 Exercise 


22.1. Lets, be described by an ergodic two-state Markov chain with transition matrix P 
given by [22.2.11]. Verify that the matrix of eigenvectors of this matrix is given by 


an [é ~ Pu(2 — Pu ~ Pa) ay 
(1 - pu(2 - pu - pn) 1 
with inverse 


ha 1 1 
y & — Pul(2 -— pu - Px) Cl - pe)(2 - pu - xp 


Chapter 22 References 


Andrews, Donald W. K., and Werner Ploberger. 1992. “Optimal Tests When a Nuisance 
Parameter Is Present Only under the Alternative.” Yale University. Mimeo. 
Cox, D. R., and H. D. Miller. 1965. The Theory of Stochastic Processes. London. Methuen. 
Davies, R. B. 1977. “Hypothesis Testing When a Nuisance Parameter Is Present Only under 
the Alternative.” Biometrika 64:247-54. 
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from In- 
complete Data via the EM Algorithm.” Journal of the Royal Statistical Society Series B, 
39:1~38. 
Diebold, Francis X,, Joon-Haeng Lee, and Gretchen C. Weinbach. Forthcoming. “Regime 
Switching with Time-Varying Transition Probabilities,” in C. Hargreaves, ed., Nonstationary 
Time Series Analysis and Cointegration. Oxford: Oxford University Press. 
Durland, J. Michael, and Thomas H. McCurdy. 1992. ‘Modelling Duration Dependence 
in Cyclical Data Using a Restricted Semi-Markov Process.’’ Queen’s University, Kingston, 
Ontario. Mimeo. 
Engel, Charles, and James D. Hamilton. 1990. “Long Swings in the Dollar: Are They in 
the Data and Do Markets Know It?” American Economic Review 80:689~-713. 
Everitt, B. S., and D. J. Hand. 1981. Finite Mixture Distributions. London: Chapman and 
Hall. 
Filardo, Andrew J. 1992. “Business Cycle Phases and Their Transitional Dynamics.” Federal 
Reserve Bank of Kansas City. Mimeo. 
Goodwin, Thomas H. 1993. ‘‘Business Cycle Analysis with a Markov-Switching-Model.” 
Journal of Business and Economic Statistics 11:331-39. 
Hamilton, James D. 1989. “A New Approach to the Economic Analysis of Nonstationary 
Time Series and the Business Cycle.” Econometrica 57:357-84, 
. 1990. “Analysis of Time Series Subject to Changes in Regime.” Journal of Econ- 
ometrics 45:39~70. 

——, 1991. ‘A Quasi-Bayesian Approach to Estimating Parameters for Mixtures of Nor- 
mal Distributions.” Journal of Business and Economic Statistics 9:27-39. 
. 1993a. “Specification Testing in Markov-Switching Time Series Models.” University 
of California, San Diego. Mimeo. 
. 1993b. “Estimation, Inference, and Forecasting of Time Series Subject to Changes 
in Regime,” m G. §. Maddala, C. R. Rao, and H. D. Vinod, eds., Handbook of Statistics, 
Vol. 11. New York: North-Holland. 
. 1993c. “State-Space Models,” in Robert Engle and Daniel McFadden, eds., Hand- 
book of Econometrics, Vol. 4. New York: North-Holland. 


702 Chapter 22 | Modeling Time Series with Changes in Regime 


Hansen, Bruce E. 1992. ‘The Likelihood Ratio Test under Non-Standard Conditions: Test- 
ing the Markov Switching Model of GNP.” Journal of Applied Econometrics 7:S61-82. 

. 1993. “Inference When a Nuisance Parameter Is Not Identified under the Null 
Hypothesis,” University of Rochester. Mimeo. 

Kiefer, Nicholas M. 1978. “Discrete Parameter Variation: Efficient Estimation of a Switching 
Regression Model.” Econometrica 46:427-34. 

Kim, Chang-Jin, 1993. “Dynamic Linear Models with Markov-Switching.”’ Journal of Econ- 
ometrics, forthcoming. 

Lam, Pok-sang. 1990. “The Hamilton Model with a General Autoregressive Component: 
Estimation and Comparison with Other Models of Economic Time Series.” Journal of 
Monetary Economics 26:409-32. 

Rogers, John H. 1992. “The Currency Substitution Hypothesis and Relative Money Demand 
in Mexico and Canada.” Journal of Money, Credit, and Banking 24:300-18. 

Stinchcombe, Maxwell, and Halbert White, 1993. “An Approach to Consistent Specification 
Testing Using Duality and Banach Limit Theory.” University of California, San Diego. 
Mimeo. 

Titterington, D. M., A. F. M. Smith, and U. BE. Makov. 1985. Statistical Analysis of Finite 
Mixture Distributions. New York: Wiley. 


Chapter 22 Refe ences 703 


A 


Mathematical Review 


This book assumes some familiarity with elementary trigonometry, complex num- 
bers, calculus, matrix algebra, and probability. Introductions to the first three topics 
by Chiang (1974) or Thomas (1972) are adequate; Marsden (1974) treated these 
issues in more depth. No matrix algebra is required beyond the level of standard 
econometrics texts such as Theil (1971) or Johnston (1984); for more detailed 
treatments, see O’Nan (1976), Strang (1976), and Magnus and Neudecker (1988). 
The concepts of probability and statistics from standard econometrics texts are also 
sufficient for getting through this book; for more complete introductions, see Lind- 
gren (1976) er Hoel, Port, and Stone (1971). 

This appendix reviews the necessary mathematical concepts and results. The 
reader familiar with these topics is invited to skip this material, or consult sub- 
headings for desired coverage. 


A.1. Trigonometry 


Definitions 
Figure A.1 displays a circle with unit radius centered at the origin in (x, y)- 
space. Let (xo, Yo) denote some point on this unit circle, and consider the angle 6 


between this point and the x-axis. The sine of @ is defined as the y-coordinate of 
the point, and the cosine is the x-coordinate: 


sin(@) = yo [A.1.1] 

cos(8) = Xp. [A.1.2] 

This text always measures angles in radians. The radian measure of the angle 

is defined as the distance traveled counterclockwise along the unit circle starting 

at the x-axis before reaching (Xo, yo). The circumference of a circle with unit radius 

is 27. A rotation one-quarter of the way around the unit circle would therefore 

correspond to radian measure of 6 = 4(27) = 7/2. An angle whose radian measure 

is 7/2 is more commonly described as a right angle or a 90° angle. A 45° angle has 
radian measure of 77/4, a 180° angle has radian measure of 7, and so on.. 


Polar Coordinates 


Consider a smaller triangle—say, the triangle with vertex (x,, y,) shown in 
Figure A.1—that shares the same angle @ as the original triangle with vertex 


704 


FIGURE A.1_ Trigonometric functions as distances in (x, y)-space. 


(xo, Yo). The ratio of any two sides of such a smaller triangle will be the same as 
that for the larger triangle: 


Yiley = yo/l [A.1.3] 
X,/e, = x9/1. [A.1.4] 


Comparing [A.1.3] with [A.1.1], the y-coordinate of any point such as (x;, y,) in 
(x, y)-space may be expressed as 


yi = ¢,'sin(), [A.1.5] 


where c, is the distance from the origin to (x,, y,) and @ is the angle that the point 
(x1, y1) makes with the x-axis. Comparing [A.1.4] with [A.1.2], the x-coordinate 
of (x1, y,) can be expressed as 


x, = ¢,:cos(@). [A.1.6] 


Recall further that the magnitude c,, which represents the distance from the origin 
to the point (x;, y1), is given by the formula 


c, = Vat + y?. {A.1.7] 


Taking a point in (x, y)-space and writing it as (c-cos(@), c-sin(@)) is called de- 
scribing the point in terms of its polar coordinates c and 6. 


A.l. Trigonometry 705 


Properties of Sine and Cosine Functions 


The functions sin(@) and cos(@) are called trigonometric or sinusoidal func- 
tions. Viewed as a function of @, the sine function starts out at zero: 


sin(0) = 0. 


The sine function rises to 1 as @ increases to 7/2 and then falls back to zero as @ 
increases further to a; see panel (a) of Figure A.2. The function reaches its min- 
imum value of —1 at @ = 32/2 and then begins climbing back up. 

If we travel a distance of 27 radians around the unit circle, we are right back 
where we started, and the function repeats itself: 


sin(27 + @) = sin(@). 


The function would again repeat itself if we made two full revolutions around the 
unit circle, Indeed for any integer j, 


sin(2aj + 6) = sin(é). [A.1.8] 


n/2 " an 


(a) sin(@) 


1/2 ‘ax /2 an 


(b) cos(6) 
FIGURE A.2. Sine and cosine functions. 


706 Appendix A | Mathematical Review 


The sine function is thus periodic and is for this reason often useful for describing 
a time series that repeats itself in a particular cycle. 

The cosine function starts out at unity and falls to zero as @ increases to 
7/2; see panel (b) of Figure A.2. It turns out simply to be a horizontal shift of the 
sine function: 


cos(8) = sin( 0 + 2). [A.1.9] 


The sine or cosine function can also be evaluated for negative values of @, 
defined as a clockwise rotation around the unit circle from the x-axis. Clearly, 
sin(— 6) = —sin(@) [A.1.10} 
cos(—@) = cos(@). [A.1.11} 
For (%o, yo) 4 point on the unit circle, [A.1.7] implies that 
1 = Vx2 + yd, 
or, squaring both sides and using [A.1.1] and [A.1.2], 
1 = [cos(@)? + [sin(@)}?. [A.1.12} 


Using Trigonometric Functions to Represent Cycles 


Suppose we construct the function g(@) by first multiplying @ by 2 and then 


evaluating the sine of the product: . 


g(8) = sin(26). 


This doubles the frequency at which the function cycles. When @ goes from 0 to 
a, 26 goes from 0 to 27, and so g(68) is back to its original value (see Figure A.3). 
In general, the function sin(k6) would go through & cycles in the time it takes sin(6) 
to complete a single cycle. 

We will sometimes describe the value a variable y takes on at date ¢ as a 
function of sines or cosines, such as 


y, = R-cos(wt + a). [A.1.13] 


sin(26) 


FIGURE A.3_ Effect of changing frequency of a periodic function. 


A.l. Trigonometry 707 


The parameter R gives the amplitude of [A.1.13]. The variable y, will attain a 
maximum value of + R and a minimum value of — R. The parameter a is the phase. 
The phase determines where in the cycle y, would be at t = 0. The parameter w 
governs how quickly the variable cycles, which can be summarized by either of 
two measures. The period is the length of time required for the process to repeat 
a full cycle. The period of [A.1.13] is 27/w. For example, if = 1 then y repeats 
itself every 27 periods, whereas if w = 2 the process repeats itself every a periods. 
The frequency summarizes how frequently the process cycles compared with the 
simple function cos(t); thus, it measures the number of cycles completed during 
2m periods. The frequency of cos(t) is unity, and the frequency of [A.1.13] is w. 
For example, if w = 2, the cycles are completed twice as quickly as those for 
cos(t). There is a simple relation between these two measures of the speed of 
cycles—the period is equal to 27 divided by the frequency. 


A.2. Complex Numbers 


Definitions 
Consider the following expression: 
x=, [A.2.1] 
There are two values of x that satisfy [A.2.1], namely, x = 1 andx = —-1. 
Suppose instead that we were given the following equation: 
x= -1. [A.2.2] 


No real number satisfies [A.2.2]. However, let us consider an imaginary number 
(denoted /) that does: 


i?#= ~1, [A.2.3] 
We assume that i can be multiplied by a real number and manipulated using standard 
rules of algebra. For example, 
2i + 3i = Si 
and 
(2i)-(3i) = (6)? = ~-6. 
This last property implies that a second solution to [A.2.2] is given by x = ~i: 
(-i)? = (-1@ = -1. 
Thus, [A.2.1] has two real roots (+1 and —1), whereas [A.2.2] has two imaginary 
roots (i and —). 
For any real numbers a and 6, we can construct the expression 
a+ bi. [A.2.4] 


If b = 0, then [A.2.4] is a real number; whereas if a = 0 and b is nonzero, then 
[A.2.4] is an imaginary number. A number written in the general form of [A.2.4] 
is called a complex number. 


Rules for Manipulating Complex Numbers 


Complex numbers are manipulated using standard rules of algebra. Two 
complex numbers are added as follows: 


(a, + bi) + (a, + bot) = (a, + a) + (6, + 5Q)i. 


708 Appendix A | Mathematical Review 


Complex numbers are multiplied this way: ; 
(a; + by): (@, + bzi) = ajay + a,b2i + byazi + byb,i? 
(4:4, ~ b,b2) + (a,b2 + bya,)i. 


Note that the resulting expressions are always simplified by separating the real 
component (such as [a,@, — 6,b2]) from the imaginary component (such as 
[a,b + b,a,]é). 


Graphical Representation of Complex Numbers 


A complex number (@ + bi) is sometimes represented graphically in an 
Argand diagram as in Figure A.4. The value of the real component (a) is plotted 
on the horizontal axis, and the imaginary component (5b) is plotted on the vertical 
axis. The size, or modulus, of a complex number is measured the same way as the 
distance from the origin of a real element in (x, y)-space (see equation [A.1.7}): 


la + bil = Va? + B?. [A.2.5] 


The complex unit circle is the set, of all complex numbers whose modulus is 
1. For example, the real number +1 is on the complex unit circle (represented by 


Imaginary Axis 
| 


| 
| 
| 
| 
| 
| 
! 
\ 
1 (a24b2)12 
\ 

| 


FIGURE A.4_ Argand diagram and the complex unit circle. 


A.2. Complex Numbers 709 


the point A in Figure A.4). So are the imaginary number —i (point B) and the 
complex number (—0.6 — 0.8i) (point C). 

We will often be interested in whether a complex number is less than 1 in 
modulus, in which case the number is said to be inside the unit circle. For example, 
(-0.3 + 0.42) has modulus 0.5, so it lies inside the unit circle, whereas (3 + 4i), 
with modulus 5, lies outside the unit circle. 


Polar Coordinates 


Just as a point in (x, y)-space can be represented by its distance c from the 
origin and its angle @ with the x-axis, the complex numbera + bican be represented 
by the distance of (a, b) from the origin (the modulus of the complex number), 


R = Va? + 83, 
and by the angle @ that the point (a, b) makes with the real axis, characterized by 


cos(@) = a/R 
sin(@) = B/R. 


Thus, the complex number a + 6i is written in polar coordinate form as 


[R-cos(@) + i+R-sin(@)] = R[cos(6) + i-sin(6)}. [A.2.6] 


Complex Conjugates 


The complex conjugate of (a + bi) is given by (a — bi). The numbers (a + 
bi) and (a — bi) are described as a conjugate pair. Notice that adding a conjugate 
pair produces a real result: 


(a + bi) + (a — bi) = 2a. 
The product of a conjugate pair is also real: 
(a + bi):(a — bi) = a? + 5. [A.2.7] 


Comparing this with [A.2.5], we see that the modulus of a complex number (a + 
bi) can be thought of as the square root of the product of the number with its 
complex conjugate: 


la + bil = Va + bia — Bi). [A.2.8] 


Quadratic Equations 
A quadratic equation 
ax? + Bx+y=0 [A.2.9} 
with a # 0 has two solutions: 
_ —B + ? - 4ey)” 


1= ove [A.2.10] 
x2 = OB See [A.2.11] 


When (8? — 4ay) = 0, both these roots are real, whereas when (B? ~ 4ay) < 0, 
the roots are complex. Notice that when the roots are complex they appear as a 


710 Appendix A | Mathematical Review 


conjugate pair: 
x, = {—Bi[2a]}} + {(1/[2e])(4ay ~ B?)'7}i 
x, = {—Bl[2a]} — {(1/[2a])(4ay ~ B?)?h. 


A.3. Calculus 


Continuity 


A function f(x) is said to be continuous at x = c if f(c) is finite and if for 
every « > 0 there is a § > 0 such that | f(x) — f(c)| < « whenever |x — c| < 6. 


Derivatives of Some Simple Functions 
The derivative of f(-) with respect to x is defined by 
d = 
OF ti FE + A) = FQ) 
40 A 
provided that this limit exists. 
If f(-) is linear in x, or 
f(x) = @ + Bx, 
then the derivative is just the coefficient on x: 
df_., [a+ p(x + A - [e+ Ax], BA_ 
Page: A ea ee 
For a quadratic function 
fe) = x, 
the derivative is 
— x2 
df _ lim +t _AF x 


dx a0 A 
. [x2 + 2x + A2] — x? 
in -——_———___—_—_———— 
4-0 A 
lim {2x + A} 
40 

= 2x, 
and in general, d. 

a 
<< = kxk-3, [A.3.1] 


For the trigonometric functions, it can be shown that when x is measured in radians, 


d sin(x) _ 
aes 
d cos(x) _ 
a ea 


cos(x) ' [A.3.2] 
~sin(x). [A.3.3] 


A.3. Calculus 711 


The derivative df (x)/dx is itself a function of x. Often we want to specify the 
point at which the derivative should be evaluated, say, c. This is indicated by 
af (x) 
ax 


xc 


For example, 


= 2x|,.3 = 6. 


Note that this notation refers to taking the derivative first and then evaluating the 
derivative at a particular point such as x = 3. 


Chain Rule 


The chain rule states that for composite functions such as 


g(x) = f(u@)), 


the derivative is 


dg(x df du 
89 a [A3.4] 
For example, to evaluate 
d(a + Bx)* 
- ak” 
we let f(u) = u* and u(x) = @ + Bx. Then 
Gf du _ yt. 
du dx aaa 
Thus, 
k 
ee. EY = pk(a + Bx)?. 


Higher-Order Derivatives 


The second derivative is defined by 


For example, 
dxt _ dfkx*} _ 


oe z (k — 1)x*-? 
and 
d? sin(x) _ d cos(x) _ ~sin(x). [A.3.5] 


dx? ax 


In general, the jth-order derivative is the derivative of the (j — 1)th-order 
derivative. 


712 Appendix A | Mathematical Review 


Geometric Series 


Consider the sum 


Sp= 1+ 6+ ¢8 4+ P +--+ 4+ OF [A.3.6} 
Multiplying both sides of [A.3.6] by ¢, 
bsr= b+ P+ Pr ter + HT + HT. [A.3.7] 
Subtracting [A.3.7] from [A.3.6] produces 
(i -— $)sy = 1 - $77). [A.3.8] 


For any @ # 1, both sides of [A.3.8] can be divided by (1 — ¢). Hence, the sum 
in [A.3.6] is equal to 


ran T+1 
es eee 
Ss; = b [A.3.9} 
T+1 @ = 1. 
From [A.3.9], 
: 1 
lim sr= 7] Olé <1. 
and so 
Ce eee ee re ld}<1.  — [A.3.10} 


Taylor Series Approximations 


Suppose that the first through the (r + 1)th derivatives of a function f(x) 
exist and are continuous in a neighborhood of c. Taylor’s theorem states that the 
value of f(x) atx = c + A is given by 


2 
f(e + A) = f(c) + a a + ae “A? 


me [A.3.11] 
1 df a5 ect OL r 
+ 31 de3| aX + + Hide |i A’ + R,(c, x), 


where r! denotes r factorial: 
rl = r(r — 1)-(r — 2)-+-2-1. 
The remainder R,(c, x) is given by 


1 d'*lf 


R,(c, x) = G+! ae 


Art), 


xwd 


where 5 is a number between c and x. Notice that the remainder vanishes for small 
A: : 


. R,(c, x) 
lim —-——_ = 0. 
mo 


A.3. Calculus 713 


Setting R,(c,x) = Oandx = c + Ain[A.3.11] produces an rth-order Taylor series 
approximation to the function f(x) in the neighborhood of x = c: 


df 1 df 
mmeaforZ| -@-90+554) -@-oF 
a pane 2! dx? | ae [A.3.12} 
1d 
+ “+ oo Las (x -— cy. 


Power Series 


If the remainder R,(c, x) in [A.3.11] converges to zero for all x as r— ~, a 
power series can be used to characterize the function f(x). To find a power series, 
we choose a particular value c around which to center the expansion, such as ¢ = 
0. We then use [A.3.12] with r > ~, For example, consider the sine function. The 
first two derivatives are given by [A.3.2] and [A.3.5], with the following higher- 
order derivatives: 


oa 

TENS) oo = cos 
Pe 

go = sin(x) 
ae 

gee) = cos(x), 


and so on. Evaluated at x = 0, we have 
f() = sin(0) = 0 


at = = cos(0) = 1 
a os = —sin(0) = 0 
at 7 = —cos(0) = -1 
“t en sin(0) = 0 
of i = cos(0) = 1. 


Substituting into [A.3.12] with c = 0 and letting r > = produces a power series 
for the sine function: 


Be se ct ae bein Dy 
sin(x) = x a* tse He + : [A.3.13] 
Similar calculations give a power series for the cosine function: 
=jl- i 24 i 4 i 6 eae A.3.1 
cos(x) = me tat - at + ; [ 3.14] 


Exponential Functions 
A number y raised to the power x, 
fo=7, 
714 Appendix A | Mathematical Review 


is called an exponential function of x. The number y is called the base of this 
function, and x is called the exponent. To multiply two exponential functions that 
share the same base, the exponents are added: 


(HG) = yer, [A.3.15} 
For example, 
(7)-?) = Ora Oren = 7: 
To raise an exponential function to the power k, the exponents are multiplied: 
[yi = [A.3.16] 
For example, 
[YP = (77) [77 [7] = 7° 


Exponentiation is distributive over multiplication: 


(@-B)* = (a*)-(6"). [A.3.17] 
Negative exponents denote reciprocals: 
yh me (l7y'): 
Any number raised to the power 0 is taken to be equal to unity: 
y=. [A.3.18] 
This convention is sensible, since if y = —x in [A.3.15], 


(y*)(y-*) = 7° 
and 


y* 


x 


(vy) == = 1. 


4 


The Number e 


The base for the natural logarithms is denoted e. The number e has the 
property that an exponential function with base e equals its own derivative: 
de* 


ae = e*. [A.3.19] 


Clearly, all the higher-order derivatives of e* are equal to e* as well: 
d'e* 
dx’ 


= e*, . [A.3.20] 


We sometimes use the expression “‘exp[x]”’ to represent “e raised to the power 


” 


ee 
exp[x] = e*. 


If u(x) denotes a separate function of x, the derivative of the compound 
function e““) can be evaluated using the chain rule: 
de“) — de“ du du 


ar = ai he =. et) ie [A.3.21] 


A.3. Calculus 715 


To find a power series for the function f(x) = e*, notice from [A.3.20] that 


i e*, 
and so, from [A.3.18], 
a'f = 
1 A.3. 
Flose e [A.3.22] 


for all r. Substituting [A.3.22] into [A.3.12] with c = 0 yields a power series for 
the function f(x) = e*: 
2 


Bes x xe xs 
\ = Ll+x+ 


or ar al [A.3.23] 


Setting x = 1 in [A.3.23] gives a numerical procedure for calculating the 
value of e: 
1 


1 1 
e=ltlea tg tat: = 2-71828.... 


Euler Relations and De Moivre’s Theorem 


Suppose we evaluate the power series [A.3.23] at the imaginary number x = 
id, where i = \/—1 and @ is some real angle measured in radians: 
A)2 ;9)3 + A)4 i@ 5 
a toy 4 OE, EO, OO 


a a4 ; ee A 
{i-$+F--.. +: Om at ep tS : 


Reflecting on [A.3.13] and [A.3.14] gives another interpretation of [A.3.24]: 


e!6 = cos(@) + i-sin(@). [A.3.25} 
Similarly, 
10 om py (oO)? | (=i)? | (- i8)* | (- 18)? 
e sane (ats oes tgp gp tee 


e a : a 8 
{.-$+8—..}-ifo-F4+F-..} [A.3.26] 


= cos(@) — é-sin(@). 


To raise a complex number (a + bi) to the kth power, the complex number 
is written in polar coordinate form as in [A.2.6]: 


a+ bi = R[{cos(@) + i-sin(@)]. 
Using [A.3.25], this can then be treated as an exponential function of @: 


a+ bi = Re’. [A.3.27] 
Now raise both sides of [A.3.27] to the kth power, recalling [A.3.17] and [A.3.16]: 
(a + bi) = R*-[ef}* = RE- ef, [A.3.28] 


Finally, use [A.3.25] in reverse, 
e() = cos(@k) + i-sin(6k), 


716 Appendix A | Mathematical Review 


to deduce that [A.3.28] can be written 


(a + bi)* = R*-[cos(@k) + i-sin(0k)]. [A.3.29] 


Definition of Natural Logarithm 


The natural logarithm (denoted throughout the text simply by “log”) is the 
inverse of the function e*: 


log(e*) = x. 
Notice from [A.3.18] that e° = 1 and therefore log(1) = 0. 


Properties of Logarithms 
For any x > 0, it is also the case that 
x = else), [A.3.30] 


From [A.3.30] and [A.3.15], we see that the log of the product of two numbers is 
equal to the sum of the logs: 


log(a-b) = log[(e!2) - (e!08())} = log[e “oae) +108(6))) = log(a) + log(b). 
Also, use [A.3.16] to write 


x!= [eles] = et loge), [A.3.31] 


Taking logs of both sides of [A.3.31] reveals that the log of a number raised to the 
a power is equal to a times the log of the number: 


log(x*) = a-log(x). 


Derivatives of Natural Logarithms 


Let u(x) = log(x), and write the right side of [A.3.30] as e*“). Differentiating 
both sides of [A.3.30] using [A.3.21] reveals that 


dx = eioatey 2 log(x) 
dx ; dx 


or 
_ _ 4 log(x) 
1= aT aR 
Thus, 
dlog(x) 1 
=, 3.32 
deo LAs] 


Logarithms and Elasticities 


It is sometimes also useful to differentiate a function f(x) with respect to the 
variable log(x). To do so, write f(x) as f(u(x)), where - 


u(x) = expllog(x)]. | 
A.3. Calculus 717 


Now use the chain rule to differentiate: 
df (x) 7 df du 
dlog(x) du d log(x) 


[A.3.33] 


But from [A.3.21], 


oe xpllog(o) £1282 alee) . [A.3.34] 


du 
d log(x) 


Substituting [A.3.34] into [A.3.33] gives 


di _ af 
d log(x) dx 


It follows from [A.3.32] that 


dlogf(x) _1, af. [f(~_+ A) — F@&)VF@ 
d log x fi de [@+A)—xpx ’ 


which has the interpretation as the elasticity of f with respect to x, or the percent 
change in f resulting from a 1% increase in x. 


Logarithms and Percent 


An approximation to the natural log function is obtained from a first-order 
Taylor series around c = 1: 


d aise) 


log(1 + A) = log(1) +S] A. [A.3.35] 


But log(1) = 0, and 


d log(x) 
dx 


a=1 
Thus, for A close to zero, an excellent approximation is provided by 
log(1 + A) =A. [A.3.36] 


An implication of [A.3.36] is the following. Let 7 denote the net interest rate 
measured as a fraction of 1; for example, r = 0.05 corresponds to a 5% interest 
rate. Then (1 + 7) denotes the gross interest rate (principal plus net interest). 
Equation [A.3.36] says that the log of the gross interest rate (1 + r) is essentially 
the same number as the net interest rate (r). 


Definition of Indefinite Integral 


Integration (indicated by f dx) is the inverse operation from differentiation. 
For example, 


[= dx = x?/2, [A.3.37] 


718 Appendix A | Mathematical Review 


because 


2 
a) 7? [A.3.38] 
The function x?/2 is not the only function satisfying [A.3.38]; the function 
(x?/2) + C 


also works for any constant C. The term C is referred to as the constant of inte- 
gration. 


Some Useful Indefinite Integrals 


The following integrals can be confirmed from [A.3.1], [A.3.32], [A.3.2], 
[A.3.3], and [A.3.21]: 


[ xt ae = oar +C k#-1 [A.3.39] 

_1 7, _ flog) +C x>0 
[x dx = eee aa ee [A.3.40] 
{ costx) dx = sin(x) + C [A.3.41] 
{ sings dx = —cos(x) + C [A.3.42] 
| e* dx = (1/a)-e% + C. [A.3.43] 


It is also straightforward to demonstrate that for constants @ and b not de- 
pending on x, 


[ te: + b-g(e)) ax dx =a fx) dx+b| ge)de+C. 


Definite Integrals 


Consider the continuous function f(x) plotted in Figure A.5. Define the 
function A(x; a) to be the area under f(x) between a and x, viewed as a function 
of x. Thus, A(b; a) would be the area between a and b. Suppose we increase b by 
asmall amount A. This is approximately the same as adding a rectangle of height 
f(5) and width A to the area A(b; a): 


A(b + 43a) = A(b; a) + f(b)-A, 


or 
A(b + A; a) — A(b;a 
AO 5 BS) AOL) ws 400), 
In the limit as A > 0, 
dA(x; a) _ 
Een ler f(®). [A.3.44] 


Now, [A.3.44] has to hold for any value of b > a that we might have chosen, 
A.3. Calculus 719 


FIGURE A.5_ The definite integral as the area under a function. 


implying that the area function A(x; a) is the inverse of differentiation: 


A(x; a) = F(x) + C, [A.3.45] 
where 
a - Fen. 


To find the value of C, notice that A(a; a) in [A.3.45] should be equal to zero: 
A(a; a) = 0= F(a) + C. 
For this to be true, 
= —F(a). [A.3.46] 
Evaluating [A.3.45] at x = 6, the area between a and 5 is given by 
A(b; a) = F(b) + C, 

or using [A.3.46], 

A(b; a) = F(b) — F(a), [A.3.47] 
where F(x) satisfies dF/dx = f(x): 


F(x) = [ re) dx. 


Equation [A.3.47] is known as the fundamental theorem of calculus. 
The operation in [A.3.47] is known as calculating a definite integral: 


[ fe) a= [te ax| ase | [#00 a:| 


For example, to find the area under the sine function between @ = 0 and 
6 = w/2, we use [A.3.42]: 


x x=a 


720 Appendix A | Mathematical Review 


[P sine) ae = [-cos(@)] ena — [-£054)]] eno 


= [-cos(z/2)] + [cos(0)] 
=0+1 
= 1. 


To find the area between 0 and 27, we take 


f sin(x) dx = [-—cos(27)] + cos(0) 


ae ee 
= 0. 


The positive values for sin(x) between 0 and 7 exactly cancel out the negative 
values between 7 and 27. 


A.4. Matrix Algebra 


Definitions 


An (mm  m) matrix is an array of numbers ordered into m rows and n columns: 


%, G2 *'* Gin 

ot ca Saige 
(mxn) : i aie : 

Qmi Am2 °° * mn 


If there is only one column ( = 1), then A is described as a column vector, whereas 
with only one row (m = 1), Ais called a row vector. A single number (m = 1 and 
m = 1) is called a scalar. 

If the number of rows equals the number of columns (m = n), the matrix is 
said to be square. The diagonal running through (a,;, @2, . . - , Qn») in a square 
matrix is called the principal diagonal. If all elements off the principal diagonal 
are zero, the matrix is said to be diagonal. ° 

A matrix is sometimes specified by describing the element in row 7, column 


I 
A = [a,)). 
Summation and Multiplication 
Two (m X n) matrices are added element by element: 
44, 412 °°" Gin by by +t: by, 
f21 422° * Gon | by by +++ dam 
Qmt Amz °** Amn Bt Oma *** Oma 
Ay, + dy apt by +++ Ain + Dyn 
@y, + by, Gy + by +++ Aan + Do 
aint aE Om amg + Oma St Amn + Dinn 


A 4 Matriv Alashen "991 


or, more compactly, 
A + B = [a + 5,}. 
(mxn) (mxn) 
The product of an (m Xx n) matrix and an (m X q) matrix is an (m x q) matrix: 


Ax B= C, 
(mxn)  (nxq) (mq) 


where the row i, column j element of C is given by =4.14,,,;. Notice that mul- 
tiplication requires that the number of columns of A be the same as the number 


of rows of B. 
To multiply A by a scalar a, each element of A is multiplied by a: 
@axA-= C, 


(1x1) (xa) (xn) 


with 
C = [ea,,]. 
It is easy to show that addition is commutative: 
A+B=B+A; 


whereas multiplication is not: 
AB # BA. 


Indeed, the préduct BA will not exist unless m = q, and even where it exists, AB 


would be equal to BA only in rather special cases. 
Both addition and multiplication are associative: 
(A+B) +C=A+(B+C) 


(AB)C = A(BC). 


Identity Matrix 


The identity matrix of order n (denoted I) is an (” x n) matrix with 1s along 


the principal diagonal and Os elsewhere: 


1 0 -:- QO 
oe Ls B 
0 0 1 
For any (m X n) matrix A, 
AXL=A 
and also 
IL, XA=A 


Powers of Matrices 

For an (m x n) matrix A, the expression A? denotes A-A. The expression 
A“ indicates the matrix A multiplied by itself k times, with A° interpreted as the 
(n X n) identity matrix. 


722 Appendix A | Mathematical Review 


Transposition 
Let a,; denote the row i, column j element of a matrix A: 


A = [q;)]. 
The transpose of A (denoted A’) is given by 
A’ = [a,;]. 
For example, the transpose of 
2 4 6 
3 5 7 
12 3 


23 1 
4 5 2}, 
6 7 3 
The transpose of a row vector is a column vector. 
It is easy to verify the following: 


(A’)'=A [A.4.1] 
(A + B)' = A’ +B’ [A.4.2] 
(AB)’ = B’A’. [A.4.3] 


Symmetric Matrices 
A square matrix satisfying A = A’ is said to be symmetric, 


Trace of a Matrix 

The trace of an (n x n) matrix is defined as the sum of the elements along 
the principal diagonal: 
: trace(A) = ay, + @. +++ * + Gan. 

If A is an (m X n) matrix and B is an (n x m) matrix, then AB is an (m x 
m) matrix whose trace is 


a 


, Ay Dix. 


=1j)= 


a a a m 
trace(AB) = >, aby, + >) aybe t+ ++ + > Gnybim = 
j=l j=l j=l k 
The product BA is an (m x n) matrix whose trace is 
m m m na m 
trace(BA) = >) bigda, + > Dade t+ + D Daedin = Dd Dy buy. 
k=l k=l k=l ful kel 


Thus, 
trace(AB) = trace(BA). 
If A and B are both (m x n) matrices, then 
trace(A + B) = trace(A) + trace(B). 


A.4, Matrix Algebra 723 


If Ais an (m X n) matrix and A is a scalar, then 


trace(AA) = Ds Aa; = AD a, = A-trace(A). 


Partitioned Matrices 


A partitioned matrix is a matrix whose individual elements are themselves 
matrices. For example, the (3 x 4) matrix 


By, Bn 3 A 
A= | 4 42 G3 Qyq 


could be written as 


where 


A, = 411 442 A, 413° 44 
423 Arq 
ai = (4, a2] a, = [433 aq]. 
Partitioned matrices are added or multiplied as if the individual elements were 


scalars, provided that the row and column dimensions permit the appropriate matrix 
operations. For example. 


A, Ay B, Ba A, + B, A, +B, 
(my Xm) (my X22) + (myx) (my X ng) (my xy) (mm) X13) 
3 B; Ba A; + Bz; A, + B,|’ 
(mz Xn) (en, X09) (m2 % 14) (ny Xn) (m2) (myxay) 
Similarly, 
A, A; B, B, A,B, + A,B, A,B, + A,B, 
(my Xny) (my X ng) (x91) (mr Xa2d] (my x41) (m1 X92) 
A; Ag B, B, A;B, + A,B; A;B, + A,B,]" 
(mz 1) (ny x 22) (m2Xq1) (a2 * 42). (m2 41) (m2 *q2) 


Definition of Determinant 
The determinant of a2 X 2 matrix is given by the following scalar: 
Ba |A] = 441422 — aya. [A.4.4] 


The determinant of an n x n matrix can be defined recursively. Let A,; denote 
the (n — 1) X (m — 1) matrix formed by deleting row i and column j from A. The 
determinant of A is given by 

|A| = 2 (-1)/*a,,| Ay. [A.4.5] 


For example, the determinant of a 3 x 3 matrix is 
41 Ayn 3 


Gy, 422, 43] = Ay 
431 432 33 


a1 
a3, 


a22 


23] g, 
432 433 


724 Appendix A | Mathematical Review 


Properties of Determinants 


A square matrix is said to be lower triangular if all the elements above the 
principal diagonal are zero (a,, = 0 for j > i): 


a, 0 O «++ 0 
Oa a 
da Ona Ons a ore Bian 


The determinant of a lower triangular matrix is simply the product of the terms 
along the principal diagonal: 


|Al = Gy0x2* * * Onn: [A.4.6] 


That [A.4.6] holds tor n = 2 follows immediately from [A.4.4] Given that it holds 
for a matrix of order n — 1, equation [A.4.5] implies that it holds for n: 


a 0 0 + 0 
a. 0 -+: O 

|A| = ay e = opaeerrmee ta) Naas O-|Ay| + +++ + 0-|Arl- 
ee ae 


An immediate implication of [A.4.6] is that the determinant of the identity 
matrix is unity: 


1, | = 1. [A.4.7] 


Another useful fact about determinants is that if an n  n matrix A is mul- 
tiplied by a scalar a, the effect is to multiply the determinant by @”: 


|@A| = a” Al. [A.4.8] 
Again, [A.4.8] is immediately apparent for the n = 2 case from [A.4.4}: 
Al = [F411 F412 
lel Az, @A22 


= (a4,,0a,) — (aaa) 
(41142 — 42421) 
a?|Al. 


I 


i] 


Given that it holds for n — 1, it is simple to verify for m using [A.4.5]. 

By contrast, if a single row of A is multiplied by the constant @ (as opposed 
to multiplying the entire matrix by @), then the determinant is multiplied by a. If 
the row that is multiplied by @ is the first row, then this result is immediately 
apparent from [A.4.5]. If only the ith row of A is multiplied by a, the result can 
be shown by recursively applying [A.4.5] until the elements of the ith row appear 
explicitly in the formula. 

Suppose that some constant c times the second row of a2 x 2 matrix is added 
to the first row. This operation has no effect on the determinant: 


Qy, + CO, Az + Cay 
a2 ay 


(@y, + CAz1)@x2 —, (412 + CAz2)an 


= @11822 — @ 124}. 


Similarly, if some constant c times the third row of a 3 X 3 matrix is added to the 


A.4, Matrix Algebra 725 


second row, the determinant will again be unchanged: 


ay 12 a3 
Qz, + CA3, Gog + CA32 A273 + CA33 
a3) &32 a33 
= Qx2 + CA32 G23 + CA33 y, + C3, 23 + CA33 
= ayy 12 
Q32 a33 a3, a33 
+ a3 a2, + C3, 422 + Ca32 
a3 Q32 
Q a a, a ay, a 
= 22 a3] _ 21 43 21 22 
= ay ay : + a3 : 
432 a3 a3 43, 32 


In general, if any row of ann x n matrix is multiplied by c and added to another 
row, the new matrix will have the same determinant as the original. Similarly, 
multiplying any column by c and adding the result to another column will not 
change the determinant. 

This can be viewed as a special case of the following result. If A and B are 
both n X n matrices, then 


|AB| = |A|-|B]. [A.4.9] 


Adding c times, the second column of a 2 x 2 matrix A to the first column can be 
thought of as postmultiplying A by the following matrix: 


[ 


Since B is lower triangular with 1s along the principal diagonal, its determinant is 
unity, and so, from [A.4.9], 


|AB| = |Al. 


Thus, the fact that adding a multiple of one column to another does not alter the 
determinant can be viewed as an implication of [A.4.9]. 

If two rows of a matrix are switched, the determinant changes signs. To switch 
the ith row with the jth, multiply the ith row by —1; this changes the sign of the 
determinant. Then subtract row i from row j, add the new j back to i, and subtract 
i from j once again. These last operations complete the switch and do not affect 
the determinant further. For example, let A be a (4 x 4) matrix written in par- 
titioned form as 


ay 
a2 
a3 
a4 


where the (1 x 4) vector a; represents the ith row of A, The determinant when 
rows 1 and 4 are switched can be calculated from 


aj —al —ay a4 a4 
a _| a ay “a a, _ _| a2 
a; a; a3 a3 a3 

¥ ? ? ; ? , . t 
ai as aj + a4 aj + a4 ai 


726 Appendix A | Mathematical Review 


This result permits calculation of the determinant of A in reference to any 
row of an (m X nm) matrix A: 


|A| = > (-1)'*/ay|Ai,|. [A.4.10] 


To derive [A.4.10], define A* as 


Then, from [A.4.5], 
Jarl = 3 (-rtanlagl = 3) (0/9 Aah 


Moreover, A* is obtained from A by ( — 1) row switches, such as switching / with 
i-1,i-1withi - 2,...,and2 with 1. Hence, 


JAl = (ay fart = (ay 2B (ta Ai, 


as claimed in [A.4.10]. 

An immediate implication of [A.4.10] is that if any row of a matrix contains 
all zeros, then the determinant of the matrix is zero. 

It can also be shown that the transpose of a matrix has the same determinant 
as the original matrix: 


|A’| = |Al. [A.4.11] 


This means, for example, that if the kth column of a matrix consists entirely of 
zeros, then the determinant of the matrix is zero. It also implies that the determinant 
of an upper triangular matrix (one for which a,, = 0 for all j < 7) is the product of 
the terms on the principal diagonal. 


Adjoint of a Matrix 


Let A denote an (m x m) matrix, and as before let A,; denote the {(n — 1) x 
(n — 1)] matrix that results from deleting row j and column i of A. The adjoint of 
A is the (n X n) matrix whose row i, column j element is given by (—1)'*/| Aj,|. 


Inverse of a Matrix 


If the determinant of ann x n matrix A is not equal to zero, its inverse (an 
n X n matrix denoted A~') exists and is found by dividing the adjoint by the 
determinant: 


Aq? = (1//Al)-[(— 1)! Aj]. [A.4.12] 


A.4. Matrix Algebra 727 


For example, for n = 2, 
41 ay a2 — ap 
= (1/4142, — a42@,})- : A.4.13 
E ial (111422 12421}) Ee day [ ] 


A matrix whose inverse exists is said to be nonsingular. A matrix whose determinant 
is zero is singular and has no inverse. 
When an inverse exists, 


Ax A7=1,, [A.4.14] 
Taking determinants of both sides of [A.4.14] and using [A.4.9] and [A.4.7], 
|A|:|A~*] = 1, 
so 
|A-*| = VAI. [A.4.15] 


Alternatively, taking the transpose of both sides of [A.4.14] and recalling 
[A.4.3], 


(A-?)’A’=1,, 

which means that (A~!)’ is the inverse of A’: 

(A) = (AE 
For @ a-nonzero scalar and A a nonsingular matrix, 

[@A]~? = @tA-}, 

Also, for A, B, and C all nonsingular (x x n) matrices, 
[AB}-? = B-1A7! 

and 

[ABC]-! = C~'B-1A-1. 


Linear Dependence 


Let x,, X2,..., X, be a set of k different (n x 1) vectors. The vectors are 
said to be linearly dependent if there exists a set of k scalars (c,, c2,... , ¢,), not 
all of which are zero, such that 

CX, + GX, t+ + 6,x%, = 0. 
If no such set of nonzero numbers (c;, c2,..., ¢,) exists, then the vectors (x, 
X2,..., X,) are said to be linearly independent. 
Suppose the vectors (x,, X2,... , X;,) are collected in an (m x k) matrix T, 


written in partitioned form as 
T = [x, X, ++ Xs]. 


If the number of vectors (k) is equal to the dimension of each vector (n), then 
there is a simple relation between the notion of linear dependence and the deter- 
minant of the (n x n) matrix T; specifically, if (x,;, x, ..., X,) are linearly 
dependent, then |T| = 0. To see this, suppose that x, is one of the vectors that 
have a nonzero value of c;. Then linear dependence means that 


Ky = —(e:/cy)xz — (€3/c:)x3 — + + — (Cn/Cy)Xn- 


728 Appendix A | Mathematical Review 


Then the determinant of T is equal to 
IT] = [[-(egley)xp — (e3/ey)x3 — +++ — (C,/e%,] %2 + x, |. 


But if we add (c,/c,) times the nth column to the first column, (Cy-1/C,) times the 
(n — 1)th column to the first column, . . . , and (c2/c,) times the second column 
to the first column, the result is 


The converse can also be shown to be true: if |T| = 0, then (x,, %,... 
x,) are linearly dependent. 


Eigenvalues and Eigenvectors 


Suppose that ann X n matrix A, a nonzero n x 1 vector x, and a scalar A 
are related by 


AX = Ax, [A.4.16] 


Then x is called an eigenvector of A and A the associated eigenvalue. Equation 
[A.4.16] can be written 


Ax — AI,x 


ll 
o 


or 


(A — Al,)x = 0. [A.4.17] 


Suppose that the matrix (A — AI,) were nonsingular. Then (A — AI,)~? would 
exist and we could premultiply [A.4.17] by (A — AI,,)~! to deduce that 


= 0. 


Thus, if a nonzero vector x exists that satisfies [A.4.16], then it must be associated 
with a value of A such that (A — AI,,) is singular. An eigenvalue of the matrix A 
is therefore a number A such that 


|A — Al,| = 0. [A.4.18] 


Eigenvalues of Triangular Matrices 


Notice that if A is upper triangular or lower triangular, then A — AI, is as 
well, and its determinant is just the product of terms along the principal diagonal: 


|A i AL, | > (ay, _ A)(422 as A) we, (4nn 7 A). 


Thus, for a triangular matrix, the eigenvalues (the values of A for which this 
expression equals zero) are just the values of A along the principal diagonal. 


Linear Independence of Eigenvectors 


A useful result is that if the eigenvalues (A,, A2, ..., A,) are all distinct, 
then the associated eigenvectors (x,, X2,... , X,) are linearly independent. To see 
this for the case nm = 2, consider any numbers c, and c, such that 


CX, + Cx, = 0. [A.4.19] 


A.4. Matrix Algebra 729 


Premultiplying both sides of [A.4.19] by A produces 


CyAX, + CoAK, = CAqX, + CAQX, = 0. [A.4.20] 
If [A.4.19] is multiplied by A, and subtracted from [A.4.20], the result is 
Co(Az — Ay)xX, = 0. [A.4.21] 


But x, is an eigenvector of A, and so it cannot be the zero vector. Also, A, — A, 
cannot be zero, since A, # A,. Equation [A.4.21] therefore implies that c. = 0. 
A parallel set of calculations show that c, = 0. Thus, the only values of c, and c, 
consistent with [A.4.19] are c, = 0 and c, = 0, which means that x, and x, are 
linearly independent. A similar argument for n > 2 can be made by induction. 


A Useful Decomposition 


Suppose an n X n matrix A has n distinct eigenvalues (A;, Az, ... , An). 
Collect these in a diagonal matrix A: 
A, O ++: 0 
A= 0 Az iat 0 
0 O -: A, 
Collect the eigenvectors (x,, X2,..., Xn) in an (” X a) matrix T: 
T= [x, % +++ X)- 


Applying the formula for multiplying partitioned matrices, 


AT = [Ax Ax, ees. Ax, ]. 
But since (x,, X2,... , X,) are eigenvectors, equation [A.4.16] implies that 
AT = [AiX, A2X2 oats. AnXa]: [A.4,22] 


A second application of the formula for multiplying partitioned matrices shows 
that the right side of [A.4.22] is in turn equal to 


[AiX, Agx. +++ AnXn] 
A, 0 0 
=[x, x ccs mpl? *& 0 
0 0 dn 
= TA. 
Thus, [A.4.22] can be written 
AT = TA. : [A.4.23] 
Now, since the eigenvalues (A,, Az, ..., A,) are taken to be distinct, the 
eigenvectors (X,, X,,.. . , X,) are known to be linearly independent. Thus, |T| # 0 


and T ~! exists. Postmultiplying [A.4.23] by T ~1 reveals a useful decomposition of 
A: 
“A= TAT", [A.4.24] 


The Jordan Decomposition 


The decomposition in [A.4.24] required the (n = m) matrix A to have n 
linearly independent eigenvectors. This will be true whenever A has 7 distinct 


730 Appendix A | Mathematical Review 


eigenvalues, and could still be true even if A has. some repeated €igenvalues. In 
the completely general case when A has s < n linearly independent eigenvec- 
tors, there always exists a decomposition similar to [A.4.24], known as the 
Jordan decomposition. Specifically, for such a matrix A there exists a nonsingular 
( x n) matrix M such that 


A = MJM-}, [A.4.25] 
where the (m = n) matrix J takes the form 
J, O +s: 0 
7 Pee [A-4.26] 
0 0 -:: J, 
with 
A; 1 0 0 
0a, 1 0 
Z=|0 0a 0 [A.4.27] 
00 0 A 


Thus, J; has the eigenvalue A; repeated along the principal diagonal and has unity 
repeated along the diagonal above the principal diagonal. The same eigenvalue A, 
can appear in two different Jordan blocks J, and J, if it corresponds to several 
linearly independent eigenvectors. 


Some Further Results on Eigenvalues 


Suppose that A is an eigenvalue of the (n x n) matrix A. Then A is also an 
eigenvalue of SAS ~* for any nonsingular (mn x n) matrix S. To see this, note that 


(A — AI,)x = 0 
implies that 
S(A — AI,)S~'Sx = 0 
or 
(SAS~1 — AI,)x* = 0 [A.4.28] 


for x* = Sx. Thus, A is an eigenvalue of SAS ~! associated with the eigenvec- 
tor x*. 

From [A.4,25], this implies that the determinant of any (n x 7) matrix A is 
the same as the determinant of its Jordan matrix J defined in [A.4.26]. Since J is 
upper triangular, its determinant is the product of terms along the principal di- 
agonal, which were just the eigenvalues of A. Thus, the determinant of any matrix 
A is given by the product of its eigenvalues. 

It is also clear that the eigenvalues of A are the same as those of A’. Taking 
the transpose of [A.4.25], , 


A’ = (M’)-1J’M’, 
we see that the eigenvalues of A’ are the eigenvalues of J’. Since J’ is lower 


A.4. Matrix Algebra 731 


triangular, its eigenvalues are the elements on its principal diagonal. But J‘ and J 
have the same principal diagonal, meaning that A’ and A have the same eigenvalues. 


Matrix Geometric Series 


The results of [A.3.6] through [A.3.10] generalize readily to geometric series 
involving square matrices. Consider the sum 


S;=1,+ A+ A? +A? +--+ + A7 [A.4.29] 
for Aan (n x n) matrix. Premultiplying both sides of [A.4.29] by A, we see that 
AS; =A+ A274 A? 4 -++ + ATH AT [A.4.30] 

Subtracting [A.4.30] from [A.4.29], we find that 
(I, — A)S; = 1, — AT} [A.4.31] 


Notice from [A.4.18] that if |I, — A| = 0, then A = 1 would be an eigenvalue of 
A. Assuming that none of the eigenvalues of A is equal to unity, the matrix 
(I, — A) is nonsingular and [A.4.31] implies that 


Sr = (1, — A)", — A?*?) [A.4.32] 


if no eigenvalue of A equals 1. If all the eigenvalues of A are strictly less than 1 
in modulus, it can be shown that A7*!~> 0 as T—» &, implying that 


(Il, + A+ A? + A? +--+) = (I, — A)? [A.4.33] 


assuming that the eigenvalues of A are all inside unit circle. 


Kronecker Products 


For A an (m xX n) matrix and B a (p X q) matrix, the Kronecker product of 
A and B is defined as the following (mp) x (nq) matrix: 


4B 4,.B :*: 4,B 
AQ@B= 42;3B 4B ::: 4;,B ; 
QmiB QngB °° QingB 


The following properties of the Kronecker product are readily verified. For any 
matrices A, B, and C, 


(A @ B)’ = A’ @B’ [A.4.34] 
(A®B)@C=AW BOC). [A.4.35] 
Also, for A and B both (m x n) matrices and C any matrix, 
(A + B)@C = (A@C) + (BOC) [A.4.36] 
C @(A + B) = (C @A) + (C @B). [A.4.37] 


Let A be (m x n), B be (p x gq), C be (m x k), and D be (q X r). Then 
(A @ B)(C @ D) = (AC) @ (BD); [A.4.38] 
732 Appendix A | Mathematical Review 


that is, 


4,B 4B +--+ 4B] [ey.D ¢2D +--+ cD 
4B ayB +++ ayBj {cD cyD --- cD 
Gm B AmB °° + OanB) LeaD aD +++ CD 

> 4,;¢,BD > 4y¢,BD aa8 > 4¢,BD 

| 2D ajqnBD > a;c.BD --- > ajc, BD 

> @n;¢;:1BD D Gnj6)2BD cee D> 4mjC%.BD 


For A ( x 71) and B (p x p) both nonsingular matrices we can set C = 
A~' and D = B~! in [A.4.38] to deduce that 


(A @ B)(A-! @ B-!) = (AA~4) @ (BB-!) = 1, QI, = Ly. 
Thus, 
(A @ B)-* = (A“' @B-?). [A.4.39] 


Eigenvalues of a Kronecker Product 


For A an(n X n) matrix with (possibly nondistinct) eigenvalues (A,,A2,.. . , 


A,) and B (p X p) with eigenvalues (14, 2, ..- , @,), then the (np) eigenvalues 
of A @ B are given by A,u, fori = 1,2,...,n andj = 1,2,..., p. To see 
this, write A and B in Jordan form as 

A = M,J,M;! 

B = M;J,M;'. 


Then (M, @ Mag) has inverse given by (Mz! © M3"). Moreover, we know from 
[A.4.28] that the eigenvalues of (A @ B) are the same as the eigenvalues of 
(M4* @ Mg')(A © B)(M, @ Mg) = (Mx*AM,) @ (Mz*BMz) 
= J,@ Jz. 


But J, and J, are both upper triangular, meaning that (J, © Jg) is upper triangular 
as well. The eigenvalues of (A @ B) are thus just the terms on the principal diagonal 
of (J, © Jag), which are given by A;p;. 


Positive Definite Matrices 


An (” X n) real symmetric matrix A is said to be positive semidefinite if for 
any real (n X 1) vector x, 


x'Ax = 0. 


We make the stronger statement that a real symmetric matrix A is positive definite 
if for any real nonzero (m x 1) vector x, 


x'Ax > 0; 


hence, any positive definite matrix could also be said to be positive semidefinite. 


A.4. Matrix Algebra 733 


Let A be an eigenvalue of A associated with the eigenvector x: 
Ax = Ax. 
Premultiplying this equation by x’ results in 
x’Ax = Ax’x. 


Since an eigenvector x cannot be the zero vector, x’x > 0. Thus, for a positive 
semidefinite matrix A, any eigenvalue A of A must be greater than or equal to 
zero. For A positive definite, all eigenvalues are strictly greater than zero. Since 
the determinant of A is the product of the eigenvalues, the determinant of a positive 
definite matrix A is strictly positive. 

Let A be a positive definite (x  m) matrix and let B denote a nonsingular 
(n X n) matrix. Then B’AB is positive definite. To see this, let x be any nonzero 
vector. Define 

x = Bx. 


Then X cannot be the zero vector, for if it were, this equation would state that 
there exists a nonzero vector x such that 


Bx = 0:x, 
in which case zero would be an eigenvalue of B associated with the eigenvector x. 
But since B is nonsingular, none of its eigenvalues can be zero. Thus, x = Bx 
cannot be the zero vector, and 
r x’B’ABx = x'Ax > 0, 

establishing that the matrix B’AB is positive definite. 

A special case of this result is obtained by letting A be the identity matrix. 
Then the result implies that any matrix that can be written as B’B for some non- 


singular matrix B is positive definite. More generally, any matrix that can be written 
as B’B for an arbitrary matrix B must be positive semidefinite: 

x’B/Bx = x'x = 3+ 23 +--+ +2220, [A.4.40] 
where X = Bx. 

The converse propositions are also true: if A is positive semidefinite, then 
there exists a matrix B such that A = B’B; if A is positive definite, then there 
exists a nonsingular matrix B such that A = B’B. A proof of this claim and an 
algorithm for calculating B are provided in Section 4.4. 


Conjugate Transposes 


Let A denote an (m xX n) matrix of (possibly) complex numbers: 


Ay + Oyi tt Ay t+ Oyyi 
A =| Gt Sat 04+ Gon + Bont 
Omi + Omi °° * Amn + Omni 


The conjugate transpose of A, denoted A”, is formed by transposing A and replacing 
each element with its complex conjugate: 


Qy — Ogi t+ Om — Oni 
AH = | 42 | Byi + One a Bmai : 
Qin a Bink aed Qnn a Brant 


Thus, if A is real, then A” and A’ would denote the same matrix. 


734 Appendix A | Mathematical Review 


Notice that if an (n x 1) complex vector is premultiplied by its conjugate 
transpose, the result is a nonnegative real scalar: 


ay, + bi 
[Cay ~ bi) (@ — bai) «+> (@, — bi] | t OH 


a, + bf 


xy 


Y, (a2 + 67) =0. 


f=1 


For B a real (mm X mn) matrix and x a complex (nm x 1) vector, 


(Bx) = x4B’. 
More generally, if both B aud x are complex, 
(Bx)? = x7B*. 


Notice that if A is positive semidefinite, then 
x Ax = x7 B’Bx = xx, 


with xX = Bx. Thus, x¥Ax is a nonnegative real scalar for any x when A is positive 
semidefinite. It is a positive real scalar for A positive definite. 


Continuity of Functions of Vectors 


A function of more than one argument, such as 


¥ = Fi Xa, 2+ Aads [A.4.41] 
is said to be continuous at (cy, C2, .-- , C,) if f(C1, co, . . . , Cy) is finite and for 
every « > 0 there is a 6 > 0 such that 

\f%1, Xa. Xn) — Fler, --. 5 e,)| <e 
whenever 
(xy — c)? + G2 — &)? + +++ + (x, — ©)? < 8. 


Partial Derivatives 
The partial derivative of f with respect to x; is defined by 


f. 
—_>=] Av}. ’ oo 8 ay spe tp As Riggs eg A 
Bn ee Re aS Os [A.4.42] 
7 f@., X2n-00 2 Mpegs Xin Xpaay- + + > Xn}. 
Gradient 


If we collect the n partial derivatives in [A.4.42] in a vector, we obtain the 
gradient of the function f, denoted V: 


Of /ax, 
y. =| Fe), 


[A.4.43] 
(x1) 


of ax, 


A.4. Matrix Algebra 735 


For example, suppose f is a linear function: 
FG 1s X20. Xn) = yXy + Xp t+ + AX. [A.4.44] 


Defing a: and x to be the following (n x 1) vectors: 


a 
a=|° [A.4.45] 
a, 
x1 
x=|"?], [A.4.46] 
Xn . 
Then [A.4.44] can be written 
f(x) = a’x. 
The partial derivative of f(-) with respect to the ith argument is 
2 = 4;, 
and the gradient is 
ay 
v=|"] =a 
a, 


Second-Order Derivatives 
A second-order derivative of [A.4.41] is given by 


a7f(%1,--- 5%) _ 9 | f(t, ~~~, Xn) 
ax, Ax; Ox; ax; : 


Where second-order derivatives exist and are continuous for all i and j, the order 
of differentiation is irrelevant: 


9 1 OG, +--+ 5%n)} — 9 | FO,-.- Xn) 
ax; ax; ax; ax; : 


. . . . 2 
Sometimes these second-order derivatives are collected in ann x n matrix H called 


the Hessian matrix: 
a2 
H = f ‘ 
OX; Ox; 


af 
ox Ox’ 


We will also use the notation 


to represent the matrix H. 


736 Appendix A | Mathematical Review 


Derivatives of Vector-Valued Functions 


Suppose we have a set of m functions fi(-), fi(-),- -. » fa(*), each of which 
depends on the n variables (x1, x2, . - - , X,,). We can collect the m functions into 
a single vector-valued function: 
A® 
ray = | 2 
(mx 1) © 
Fm(X) 
We sometimes write 
f; R7 > R™ 


to indicate that the function takes n different real numbers (summarized by the 
vector x, an element of R") and calculates m different new numbers (summarized 
by the value of f, an element of R”). Suppose that each of the functions f,(-), 
fi(-), --- 5 fin(’) has derivatives with respect to each of the arguments x, x, 
... %,_- We can summarize these derivatives in an (m xX n) matrix, called the 
Jacobian matrix of f and indicated by df/ax': 


Of,/ax, af,/axy --- af, lax, 
of e Of, /dx, Of2/0x2 Bohs Of2/0X, 
ax’ ee tee. 
) 
me) Laff /Ox, AfnlaXz + * fg /AXy 
For example, suppose that each of the functions f,(x) is linear: 
FiQX) = Guy + rat. to + Ay Xn 
Fa(X) = Garey + Gy2%2 + 0 + Gandy 
Fin (X) = Ani Xy + Bn2X2 Pes QmnXn- 


We could write this system in matrix form as 


f(x) = Ax, 
where 
Qi, 42 a1 
hs 
(m xn) : : 
Q@mi Amz ~"* amn 
and x is the (n x 1) vector defined in [A.4.46]. Then 
at | 
ax’ 


Taylor's Theorem with Multiple Arguments 
Let f: R* —» R! as in [A.4.41], with continuous second derivatives. A first- 
order Taylor series expansion of f(x) around c is given by 


fx) = f©) + 2 Se AR lew: [A.4.47] 


A.4. Matrix Algebra 737 


Here af/ax' denotes the (1 = n) vector that is the transpose of the gradient, and 
the remainder R,(-) satisfies 


“(i - c,)(%; a ¢) 


X; a 7 lx=8(6J) 


for &(i, j) an (m x 1) vector, potentially different for each i and j, with each 
&(i, j) between c and x, that is, 6, j) = AG, Je + [1 — AG, J)]x for some 
A(i, j) between 0 and 1. Furthermore, 
Ric, x) 
in ————————, 
xe [(x — ¢)'(x — ¢)]}? 


An implication of [A.4.47] is that if we wish to approximate the consequences 


= 0. 


for f of simultaneously changing x, by A, x, by Az, ..., and x, by A,, we could 
use 
f(%1 + Ay, x2 + Ag, - 2.5% t+ An) — F(X, Xa, -- Xn) 
A.4.48 
pie ee Bare sei ae: [ ] 
Ox, ax Xn 


If f(-) has continuous third derivatives, a second-order Taylor series expan- 
sion of f(x) around ¢ is given by 


‘ fa) =f0+2) -@-0 
xee [A.4.49] 
2. 
+ 5 — c)' oe he (x — c) + R,(c, x), 


(x; — Cy — GK — &%) 
x=8(i,) 


non a 3. 
Rex») = 55 > 5 


! fa jai ka OX; OX; OX, 


with 8(i, j, k) between c and x and 


lim R,(C, x) 


im Gas) 


Multple Integrals 


The notation 


bd 
[f ee.» ay ax 
indicates the following operation: first integrate 


[ Fe.» dy 


with respect to y, with x held fixed, and then integrate the resulting function with 
respect to x. For example, 


{ { xy dy dx = [, x4{(2772) — (07/2)] dx = 2[15/5 — 075] = 2/5. 


738 Appendix A | Mathematical Review 


Provided that f(x, y) is continuous, the order of integration can be reversed. For 
example, 


{ { xty dx dy = (i : (15/5)y dy = (1/5) + (22/2) = 2/5. 
A.5. Probability and Statistics 


Densities and Distributions 


A stochastic or random variable X is said to be discrete-valued if it can take 


on only one of K particular values; call these x,, x2, ..., Xx. Its probability 
distribution is a set of numbers that give the probability of each outcome: 
P{X = x,} = probability that X takes on the value x,, k=1,...,K. 


The probabilities sum to unity: 


P{X =x,} = 1. 
k=l 


Assuming that the possible outcomes are ordered x, < x2 <-+-- < Xx, the 
probability that X takes on a value less than or equal to the value x; is given by 


i 
P{X sx} = > P{X = x;,}. 


If X is equal to a constant c with probability 1, then X is nonstochastic. 
The probability law for a continuous-valued random variable X can often be 
described by the density function f,(x) with 


[ fir(x) dx = 1. [A.5.1] 


The subscript X in fy(x) indicates that this is the density of the random variable 
X; the argument x of fy(x) indexes the integration in [A.5.1]. The cumulative 
distribution function of X (denoted F,-(a)) gives the probability that X takes on a 
value less than or equal to a: 


Fx(a) = P{X = a} 


: [A.5.2] 
= |" fe) ae 


Population Moments 


The population mean p of a continuous-valued random variable X is given 
by 


w= | xfele) de, 


provided this integral exists. (In the formulas that follow, we assume for simplicity 
of exposition that the density functions are continuous and that the indicated 


A.5. Probability and Statistics 739 


integrals all exist.) The population variance is 
Var(x) = Jo ( - w*fe@) ae. 


The square root of the variance is called the population standard deviation. 
In general, the rth population moment is given by 


i x’ fiel(x) dx. 


The population mean could thus be described as the first population moment. 


Expectation 
The population mean yp is also called the expectation of X, denoted E(X) or 
sometimes simply EX. In general, the expectation of a function g(X) is given by 


E(g(X)) = f 8(x)fx(x) dx, [A.5.3] 


where f(x) is the density of X. For example, the rth population moment of X is 


the expectation of X". 
Consider the random variable a + 5X for constants @ and 5b. Its expectation 


is 
Efe 4355 i. [a + bx] -felz) de 
=a [7 teta) dx +b ie: x fie(x) dx 


=at+b-E(X). 
The variance of a + bX is 


Var(a + bX) = ie [(a + bx) — (a + by)P f(x) de 
= 62 [" =m fela) a [AS] 
= b?-Var(X). 


Another useful result is 
E(X?) = E[(X - + py 
E[(X — mw)? + 2w(X — p) + p?] 
= E[(X — w)"] + 2n-[E(X) - w] + pw? 
= Var(X) + 0 + [E(X). 

To simplify the appearance of expressions, we adopt the convention that 
exponentiation and multiplication are carried out before the expectation op- 
erator. Thus, we will use E(X — yw + yw)? to indicate the same operation as 
E((X — w + p)’]. The square of E(X — p + yp) is indicated by using additional 
parentheses, as [E(X — » + p)]?. 


Sample Moments 


A sample moment is a particular estimate of a population moment based on 
an observed set of data, say, {x,, x2, ..., Xz}. The first sample moment is the 


740 Appendix A | Mathematical Review 


sample mean, 
¥ = (UT), + x. +++ + x7), 


which is a natural estimate of the population mean yz. The sample variance, 


T 
s?= (UT): > (x, -— ¥, 
t=1 
affords an estimate of the population variance o*. More generally, the rth sample 
moment is given by 
(V/T) (x4 + xh ++°° + 2x5), 


where x/ denotes x, raised to the rth power. 


Bias and Efficiency 


Let 6 be a sample estimate of a vector of population parameters @. For 
example, 6 could be the sample mean % and @ the population mean pu. The estimate 
is said to be unbiased if E(6) = 0. 

Suppose that 6 is an unbiased estimate of 0. The estimate 6 is said to be 
efficient if it is the case that for any other unbiased estimate 6*, the following 
matrix is positive semidefinite: 


P = E[(6* — 0)-(6* — 6)'] — E[(6 — 6)-(6 — 6)'). 


Joint Distributions 


For two random variables X and Y with the joint density f, +(x, y), we 
calculate the probability of the joint event that both X = a and Y = 6 from 


‘a ‘6 
P{X sa, Ys b} = f. iz fav, y) dy dx. 


This can be represented in terms of the joint cumulative distribution function: 
Fyy(a, 6) = P{X =a, Y s 5}. 
The probability that X = a by itself can be calculated from 


P{X = a, ¥ any} = [ ie fur, y) «| dx. [A.5.5] 


Comparison of [A.5.5] with [A.5.2] reveals that the marginal density f(x) is ob- 
tained by integrating the joint density f,y(x, y) with respect to y: 


fxQ@) = ee y) ay]. [A.5.6] 


Conditional Distributions 
The conditional density of Y given X is given by 


fxey@ y) if 0 
fax(lx) = ; fe) as [A.5.7] 


otherwise. 


A.5. Probability and Statistics 741 


Notice that this satisfies the requirement of a density [A.5.1]: 


~ 


J toe ay = [7 Peed) a 


1 f= 

7 far) ie faey(%, y) dy 
Bice 
fxe®) : 


A further obvious implication of the definition in [A.5.7] is that a joint density 
can be written as the product of the marginal density and the conditional density: 


fxy@, y) = feixQ lx) -fe@)- [A.5.8] 
The conditional expectation of Y given that the random variable X takes on 
the particular value x is 


EMIX =x) = | y-fnx(ole) a. [A539] 


Law of Iterated Expectations 


Note that the conditional expectation is a function of the value of the random 
variable X. For different realizations of X, the conditional expectation will be a 
different number. Suppose we view E(Y|X) as a random variable and take its 
expectation with respect to the distribution of X: 


exten“ =f [fatale ay| fete ae. 


Results [A.5.8] and [A.5.6] can be used to express this expectation as 
{ { fy, x) dy dx = [»-feo dy. 
Thus, 
ExlEyix(¥|X)] = Ey(Y). [A.5.10] 


In words, the random variable E(Y|X) has the same expectation as the random 
variable Y. This is known as the law of iterated expectations. 


Independence 
The variables Y and X are said to be independent if 
fav, ¥) = fe) fry): [A.5.11] 
Comparing [A.5.11] with [A.5.8], if Y and X are independent, then 
fuxO |) = frO). [A.5.12] 


Covariance 
Let uy denote E(X) and py denote E(Y). The population covariance between 
X and Y is given by 


cox. ¥)= ff - wey — wy) ferry) ay de [A513] 


oo eo 


742 Appendix A | Mathematical Review 


Correlation 

The population correlation between X and Y is given by 
~ Cov(X, Y) 

VVar(X): VVar(Y)’ 


If the covariance (or correlation) between X and Y is zero, then X and Y are said 
to be uncorrelated. 


Corr(X, Y) = 


Relation Between Correlation and Independence 


Note that if X and Y are independent, then they are uncorrelated: 


Cov(X, Y) 


ff c= med - wr) fel) Fly) ay a 


-x == 


— [70 - fro) ay| teen dx, 


Furthermore, 


[fo dy — Pd f(y) ay 


= By — By 
= 0. 
Thus, if X and Y are independent, then Cov(X, Y) = 0, as claimed, 

The converse proposition, however, is not true—the fact that X and Y are 
uncorrelated is not enough to deduce that they are independent. To construct a 
counterexample, suppose that Z and Y are independent random variables each 
with mean zero, and let X = Z-Y. Then 


E(X — px)(Y — By) = E[(ZY)-Y] 
= E(Z)-E(Y) = 0, 


and so X and Y are uncorrelated. They are not, however, independent—the value 
of ZY depends on Y. 


[0 - a) fro) «| 


Orthogonality 
Consider a sample of size T on two random variables, {x,, x2, .. . , xy} and 
{Y1s Ya.» » » » ¥r}- The two variables are said to be orthogonal if 
T 


Ser 0 


7 


Thus, orthogonality is the sample analog of absence of correlation. 

For example, let x, = 1 denote a sequence of constants and let y, = w, — 
Ww, where W = (1/T)27_,w, is the sample mean of the variable w. Then x and y 
are orthogonal: ‘ 


T T 
> 1-(w,- #) = Siw, - TW =0. 
rm] tm1 


A.5. Probability and Statistics 743 


Population Moments of Sums 


Consider the random variable aX + bY. Its mean is given by 


E(aX + bY) = i { (ax + by): fx v(x, y) dy dx 


-x == 


a f i x+fxy(e, y) dy dx + b i i ¥‘fe.v(%, y) dy dx 


i [oe fe dx + b | yf) dy, 


) 


and so 
E(aX + bY) = a: E(X) + 6-E(Y). [A.5.14] 
The variance of (aX + bY) is 


var(ax + bY) = [ | [ax + by) ~ (uy + bay)P-ferle, ») ay ax 


-x == 


= ff (ar - any)? + 2(ax — ayx)(by ~ buy) 


—-x x 


+ (by — bay) fr y) dy de 


= i (x — px) fey, y) dy dx 


+ 2ab [ [madly ~ Hy) Perle 9) dy a 


—-x == 


+ b? i f (y - ty -fer(t y) dy dx. 


-x -=x 


Thus, 
Var(aX + bY) = a? Var(X) + 2ab:Cov(X, Y) + 6?-Var(Y). [A.5.15] 
When X and Y are uncorrelated, 
Var(aX + bY) = a?:Var(X) + b?-Var(Y). 


It is straightforward to generalize results [A.5.14] and [A.5.15]. If {X,, 
X,..., X,} denotes a collection of n random variables, then 


E(a,X, + aX. + °°: + a,X,) 
= a: E(X) + a2: E(X2) +++ -* + @,°E(X,) 
Var(a,X, + aX, + +++ + a,X,) 
= a}-Var(X;) + a3: Var(X,) + «+--+ a2+Var(X,) 
+ 2a,a,:Cov(X,, X2) + 2a,a,-Cow(X,, X3) + °° a [A.5.17] 
+ 2a,a,'Cov(X), X,) + 2a,a,:Cov(X2, X3) + 2a,a,-Cov(X2, X4) 
+++ + 2a,_14,-Cov(X,-1, Xn). 


[A.5.16] 


744 Appendix A | Mathematical Review 


If the X’s are uncorrelated, then [A.5.17] simplifies to 


Var(a,X, + aX, +--+ + a,X,) [A.5,18] 
= aj: Var(X,) + a3-Var(X,) +--+ + a2-Var(X,). 


Cauchy-Schwarz Inequality 


The Cauchy-Schwarz inequality states that for any random variables X¥ and 
Y whose variances and covariance exist, the correlation is no greater than unity in 
absolute value: 


—1 5 Corr(X, Y) $1. [A.5.19] 
To establish the far right inequality in [A.5.19], consider the random variable 
_ X— Bx Y — py 


~ Wvar(X)  VVar(Y) 
The square of this variable cannot take on negative values, so 
2 
Ee wx) _ (¥ = se ae 


VVar(X)  VVar(Y) 


Recognizing that Var(X) and Var(Y) denote population moments (as opposed to 
random variables), equation [A.5.15] can be used to deduce 


E(X = px)? _ 2| I = Mx)(¥ = sell + EW = oP og 
Var(X) VVar(X) VVar(Y) Var(Y) 

Thus, 

1 — 2:Corr(X, Y) + 120, 
meaning that 

Corr(X, Y) $1. 
To establish the far left inequality in [A.5.19], notice that 
[& = my) , = mall = 

VVar(X) = VVar(Y) 
implying that 

1+ 2-Corr(X, Y) + 120, 
so that 

Corr(X, Y) = - 1. 


The Normal Distribution 


The variable Y, has a Gaussian, or Normal, distribution with mean mu and 
variance go? if 


=, ans 2 
fry.) = gy one UI, [A.5.20] 
We write 
Y, ~ N(p, o?) 


to indicate that the density of Y, is given by [A.5.20]. 


A.5. Probability and Statistics 745 


Centered odd-ordered population moments for a Gaussian variable are zero: 
E(Y, — py = 0 forr = 1,3,5,.... 
The centered fourth moment is 
E(Y, — p)* = 30%. 


Skew and Kurtosis 

The skewness of a variable Y, with mean y is represented by 
E(Y, — ny 
[Var(¥,)P?° 


A variable with a negative skew is more likely to be far below the mean than it is 
to be far above the mean. The kurtosis is 


E(Y, — »)* 
[Var(¥,)} ° 


A distribution whose kurtosis exceeds 3 has more mass in the tails than a Gaussian 
distribution with the same variance. 


Other Useful Univariate Distributions 


Let (X;, Xz, ..., X,) be independent and identically distributed (i.i.d.) 
N(O, 1) variables, and consider the sum of their squares: 


Y=XP+ XF +--+ + XM. 


Then Y is said to have a chi-square distribution with n degrees of freedom, denoted 


Y~ x(n). 
Let X ~ N(0, 1) and Y ~ y(n) with X and Y independent. Then 
Xx 
2 = VW¥in 
is said to have a t distribution with n degrees of freedom, denoted 
Z~ t(n). 
Let Y, ~ y(n) and Y, ~ x7(n2) with Y, and Y, independent. Then 
_ Yalmy 
Y2/n, 


is said to have an F distribution with n, numerator degrees of freedom and n2 
denominator degrees of freedom, denoted 


Z ~ F(n, ny). 
Note that if Z ~ t(n), then Z? ~ F(1, 7). 


Likelihood Function 


Suppose we have observed a sample of size T on some random variable Y,. 
Let fy,.¥9.....¥7(Yis Yao -- - » Yr; 8) denote the joint density of Y|, Y2,..., Yr. 


746 Appendix A | Mathematical Review 


The notation emphasizes that this joint density is presumed to depend on a vector 
of population parameters 6. If we view this joint density as a function of @ (given 
the data on Y), the result is called the sample likelihood function. 

For example, consider a sample of Ti.i.d. variables drawn from a N(p, a?) 
distribution. For this distribution, 8 = (4, ¢?)', and from [A.5.11] the joint density 
is the product of individual terms such as [A.5.20]: 


T 
fry, wees v7 (1; Y2r-++ 2 Yrs Bs a?) oe i iPAGTE Bb, a), 


The log of the joint density is the sum of the logs of these terms: 


log fy,,.¥2,....¥2(V1» Yar © Yrs Bs oa?) 


T 
= 2 108 flys ms 0°) [A.5.21] 
= 2 r (y, 2 py? 
= (=T?) log(2n) ~ (712) log(o?) - > PP 


Thus, for a sample of T Gaussian random variables with mean mu and variance o?, 
the sample log likelihood function, denoted £(u, a; y1, y2,.--» Yr), iS given by: 


2 a 2 Z (y. — wy? 
L(m, 075 Vis Ya + Yr) = &K — (TR) log(a”) — = —Fu2 7 C*LAS.22] 
In calculating the sample log likelihood function, any constant term that does not 
involve the parameter « or a? can be ignored for most purposes. In [A.5.22], this 
constant term is 


k = —(T?) log(27). 


Maximum Likelihood Estimation 


For a given sample of observations (y;, y2,.--, Yr), the value of @ that 
makes the sample likelihood as large as possible is called the maximum likelihood 
estimate (MLE) of ®. For example, the maximum likelihood estimate of the pop- 
ulation mean y for an i.i.d. sample of size T from a N(y, a) distribution is found 
by setting the derivative of [A.5.22] with respect to yz equal to zero: 


On ~ tm] ao? - a: 
or 
T 
fi = (1/T) > Vie [A.5.23] 
t= . 
The MLE of o? is characterized by 
a£ T SQ wy 
==. SS-5 —“——— =0. 5.24 
da? 2a? - >» 2a* ° [a:5:24) 


Substituting [A.5.23] into [A.5.24] and solving for a? gives 
T 
6? = (VT) D (ye — By. [A.5.25] 
tml 


A.5. Probability and Statistics 747 


variance is the ME of the population variance for an i.i.d. sample of Gaussian 


Thus, the MEE mean is the MLE of the population mean and the sample 
variables. 


Multivariate Gaussian Distribution 
Let 
Y = (¥,, Yo,...,Y¥,)' 


be a collection of n random variables. The vector Y has a multivariate Normal, or 
multivariate Gaussian, distribution if its density takes the form 


f(y) = (2)-"7|Q|-*? exp[(— 1/2)(y — w)'Q-"y — p)].  [A.5.26] 
The mean of Y is given by the vector p: 
E(Y) = ps; 
and its variance-covariance matrix is Q: 
E(¥ - w\(¥ - py = ©. 


Note that (Y — »)(Y — p)’ is symmetric and positive semidefinite for any 
Y, meaning that any variance-covariance matrix must be symmetric and positive 
semidefinite; the form of the likelihood in [A.5.26] assumes that 0. is positive 
definite. * 

Result [A.4.15] is sometimes used to write the multivariate Gaussian density 
in an equivalent form: 


Fly) = (207)-"7| 2-1]? exp[(— 12)(y — w)'QA-*(y — w)]. 


If Y ~ N(p, Q), then for any nonstochastic (r x m) matrix H’ and (r x 1) 
vector b, 


H'Y + b~ N((H'p + b), HOH). 


Correlation and Independence for Multivariate 
Gaussian Variates 


If Y has a multivariate Gaussian distribution, then absence of correlation 
implies independence. To see this, note that if the elements of Y are uncorrelated, 
then E[(Y; — )(Y; — #)] = 0 for i # j and the off-diagonal elements of © are 
zero: 


at 0 0 
Q= 0 o3 0 
0 0 a 
For such a diagonal matrix 0, 
|Q| = 0203 + o [A.5.27] 
l/ct 0 0 
or 
Bere Eee ee [A.5.28] 
0 0 Vo? 


748 Appendix A | Mathematical Review 


Substituting [A.5.27] and [A.5.28] into [A.5.26] produces 
fuly) = (2m)-"*[ojo}: + - of]? . 
x exp[(—1/2){(y1 — mi/ot + (2 — mePloR + ++ 
+ (Yn — Bn) 1073] 


i (2m)-**[o7]- 1? exp[(—1/2){(y. — wi)/o7}], 


which is the product of n univariate Gaussian densities. Since the joint density is 
the product of the individual densities, the random variables (Y;, Y2,..., Yn) 
are independent. 


Probability Limit 


Let {X,, X2,... , Xy} denote a sequence of random variables. Often we are 
interested in what happens to this sequence as T becomes large. For example, X; 
might denote the sample mean of T observations: 


Xr = (UT):(Y, + Yo +°°° + Yn), [A.5.29] 


in which case we might want to know the properties of the sample mean as the 
size of the sample T grows large. 


The sequence {X,, X2,..., Xr} is said to converge in probability to c if for 
every « > 0 and 6 > 0 there exists a value N such that, for all T = N, 
P{|X; -— cl > 8} <e. [A.5.30] 
When [A.5.30] is satisfied, the number c is called the probability limit, or 
plim, of the sequence {X,, X2,... , X;}. This is sometimes indicated as 
X;> Cc. 
Law of Large Numbers 


Under fairly general conditions detailed m Chapter 7, the sample mean [A.5,29] 
converges in probability to the population mean: 


(/T)(Y, + Yo + +++ + Yr) E(Y,). [A.5.31] 


When [A.5.31] holds, we say that the sample mean gives a consistent estimate of 
the population mean. 


Convergence in Mean Square 


A stronger condition than convergence in probability is mean square conver- 
gence. The sequence {X,, X2,... , X;} is said to converge in mean square if for 
every € > 0 there exists a value N such that, for all T= N, 


E(X;- cp <e. [A.5.32] 
We indicate that the sequence converges to c in mean square as follows: 
Xp c. 
Convergence in mean square implies convergence in probability, but con- 


vergence in probability does not imply convergence in mean square. 


A.5. Probability and Statistics 749 


Appendix A References 


Chiang, Alpha C. 1974; pie ears Methods of Mathematical Economics, 2d ed. New 
York: McGraw-Hill. 

Hoel, Paul G., Sidney C. Port, and Charles J. Stone. 1971. Introduction to Probability 
Theory. Boston: Houghton Mifflin. 

Johnston, J. 1984. Econometric Methods, 3d ed. New York: McGraw-Hill. 

Lindgren, Bernard W. 1976, Statistical Theory, 3d ed. New York: Macmillan. 

Magnus, Jan R., and Heinz Neudecker, 1988. Matrix Differential Calculus with Applications 
in Statistics and Econometrics. New York: Wiley. 

Marsden, Jerrold E. 1974. Elementary Classical Analysis. San Francisco: Freeman. 
O’Nan, Michael. 1976. Linear Algebra, 2d ed. New York: Harcourt Brace Jovanovich. 
Strang, Gilbert. 1976, Linear Algebra and Its Applications. New York; Academic Press. 
Theil, Henri. 1971. Principles of Econometrics. New York: Wiley. 

Thomas, George B., Jr. 1972, Calculus and Analytic Geometry, alternate ed. Reading, Mass.: 
Addison-Wesley Publishing Company, Inc. 


750 Appendix A | Mathematical Review 


Statistical Tables 


TABLE B.1 


Standard Normal Distribution 


0 


20 


Area = Prob(Z = 2y) 


ty 
42 .00 
0.0 .5000 
0.1 .4602 
0.2 .4207 
0.3 3821 
0.4 .3446 
0.5  .3085 
0.6 .2743 
0.7 .2420 
0.8. .2119 
0.9 .1841 
1.0 1587 
1,1. .1357 
1.2  .1151 
1.3  .0968 
1.4 .0808 


01 


.4960 
4562 
-4168 
3783 
.3409 


3050 
2709 
.2389 
.2090 
1814 


1562 
1335 
1131 
.0951 
.0793 


(continued on next page) 


Second decimal place of 2, 


.03 


-4880 
4483 
.4090 
.3707 
.3336 


.2981 
.2643 
.2327 
.2033 
1762 


1515 
1292 
1093 
0918 
.0764 


.04 


-4840 
4443 
-4052 
3669 
.3300 


.2946 
2611 
.2296 
.2005 
.1736 


1492 
1271 
.1075 
0901 
.0749 


05 


4801 
4404 
-4013 
-3632 
.3264 


.2912 
2578 
. 2266 
1977 
A711 


1469 
1251 
.1056 
0885 
0735 


06 


4761 
4364 
13974 
3594 
.3228 


.2877 
.2546 


.2236 
.1949 


-1685 


.1446 


.1230 
.1038 


.0869 


0722 


4721 
-4325 
3936 
3557 
.3192 


.2843 
2514 
.2206 
1922 
.1660 


1423 
.1210 
.1020 
.0853 
.0708 


4641 
4247 
3859 
3483 
3121 


2776 
2451 
2148 
1867 
1611 


1379 
.1170 
0985 
.0823 
.0681 


751 


TABLE B.1 (continued) 


$2 -00 01 


1.5 .0668 .0655 
1.6 .0548 .0537 
1.7 .0446 .0436 
1.8 0359 .0352 
1.9 .0287 .0281 


2.0 .0228 .0222 
2.1 .0179 .0174 
2.2 .0139 .0136 
2.3. .0107 + .0104 
2.4 0082 .0080 


2.5  .0062 .0060 
2.6 .0047 .0045 
2.7 .0035 .0034 
2.8  .0026 .0025 


2.9 .0019 .0018 
3.0 .00135 

3.5 .000 233 

4.0 .000 0317 
4.5  .000 003 40 
5.0 .000 000 287 


Second decimal place of zo 


03 


.0630 
.0516 
0418 
.0336 
.0268 


.0212 
.0166 
.0129 
.0099 
0075 


.0057 
0043 
0032 
0023 
.0017 


04 


05 


06 


0594 
0485 
.0392 
.0314 
.0250 


.0197 
.0154 
.0119 
.0091 
.0069 


.0052 
.0039 
.0029 
.0021 
.0015 


07 


.0582 
0475 
.0384 
.0307 
0244 


.0192 
.0150 
.0116 
.0089 
.0068 


.0051 
.0038 
.0028 
.0021 
.0015 


Table entries give the probability that a N(0, 1) variable takes on a value greater than or equal to 2p. 
For example, if Z ~ N(0, 1), the probability that Z > 1.96 = 0.0250. By symmetry, the table entries 
could also be interpreted as the probability that a (0, 1) variable takes a value less than or equal to 


—2. 


Source: Thomas H. Wonnacott and Ronald J. Wonnacott, Introductory Statistics, 2d ed., p. 480. 
Copyright © 1972 by John Wiley & Sons, Inc., New York. Reprinted by permission of John Wiley & 


Sons, Inc. 


752 Appendix B | Statistical Tables 


TABLE B.2 
The x? Distribution 


Degrees of 
freedom 
(m) 0.995 
1 4x 10-5 
2 0.010 
3 0.072 
4 0.207 
5 0.412 
6 0.676 
7 0.989 
8 1.34 
9 1.73 
10 2.16 
11 2.60 
12 3.07 
13 3.57 
14 4.07 
15 4.60 
16 5.14 
17 5.70 
18 6.26 
19 6.84 
20 7,43 
21 8.03 
22 8.64 
23 9.26 
24 9.89 
25 10.5 
26 11.2 
27 11.8 
28 12.5 
29 13.1 
30 13.8 
40 20.7 
50 28.0 
60 35.5 
70 43.3 
80 51.2 
90 59.2 
100 67.3 


(continued on next page) 


Probability that x(m) is greater than entry 


0.990 


2x 10-4 
0.020 
0.115 
0.297 


0.554 
0.872 
1.24 
1.65 
2.09 


2.56 
3.05 
3.57 
4.11 
4.66 


5.23 
5.81 
6.41 
7.01 
7.63 


15.0 


AnD wy 
S2aG ASB 


0.975 


0.001 
0.051 
0.216 
0.484 


0.831 
1.24 


2B 


RP RRP Pe 
G8 OR SSO, 00.00% 
iy PISA 2) 


0.950 


0.004 
0.103 
0.352 
0.711 


1.15 
1.64 
2.17 
2.73 
3.33 


3.94 
4.57 
5.23 
5.89 
6.57 


7.26 
7.96 
8.67 
9.39 
10.1 


10.9 
11.6 
12.3 
13.1 
13.8 


14.6 
15.4 
16.2 
16.9 
17.7 


18.5 
26.5 
34.8 
43.2 


31.7 
60.4 
69,1 
77.9 


0.900 


0.016 
0.211 
0.584 
1.06 


1.61 
2.20 
2.83 
3.49 
4.17. 


4.87 
5.58 
6.30 
7.04 
7.79 


8.55 

9.31 
10.1 
10.9 
11.7 


12.4 
13.2 
14.0 
14.8 
15.7 


16.5 
17.3 
18.1 
18.9 
19.8 


20.6 
29.1 
37.7 
46.5 


55.3 
64.3 
73.3 
82.4 


0.750 


0.102 
0.575 
1.21 
1.92 


2.67 
3.45 
4.25 
5.07 
5.90 


6.74 
7.58 
8.44 
9.30 
10.2 


11.0 
11.9 
12.8 
13.7 
14.6 
15.5 
16.3 
17.2 
18.1 
19.0 


19.9 
20.8 
21.7 
22.7 
23.6 


24.5 
33.7 
42.9 
$2.3 


61.7 
71.1 
80.6 
90.1 


0.500 


0.455 
1.39 
2.37 
3.36 


4.35 
5.35 
6.35 


8.34 
9.34 


~ 
Ww 
- 


WWW HWW BWW BHUWWH BHBWe BUWw 


wwowowo wocwrs 


Appendix B | Statistical Tables 753 


TABLE B.2 (continued) 


Degrees of 
freedom 

(m) 0.250 

1 1.32 

2 2.77 

3 4.11 

4 3:39 

5 6.63 

6 7.84 

7 9.04 
8 10.2 
9 11.4 
10 12.5 
11 13.7 
12 14.8 
13 16.0 
14 17.1 
15 18.2 
16 19.4 
17 20.5 
18 21.6 
19 22,7 
20 23.8 
21 24.9 
22 26.0 
23 27.1 
24 28.2 
25 29.3 
26 30.4 
27 31.5 
28 32.6 
29 33.7 
30 "34.8 
40 45.6 
50 56.3 
60 67.0 
70 77.6 
80 88.1 
90 98.6 

100 109 


Probability that x?(m) is greater than entry 


0.100 


2.71 
4.61 
6.25 
7.78 


9.24 
10.6 
12.0 
13.4 
14.7 


16.0 
17.3 
18.5 
19.8 
21,1 


22.3 
23.5 
24.8 
26.0 
27.2 


28.4 
29.6 
30.8 
32.0 
33.2 


34.4 
35.6 
36.7 
37.9 
39.1 


40.3 
51.8 
63.2 
74.4 


85.5 

96.6 
108 
118 


0.050 


3.84 
5.99 
7.81 
9.49 


11.1 
12.6 
14.1 
15.5 
16.9 


18.3 
19.7 
21.0 
22.4 
23.7 


25.0 
26.3 
27.6 
28.9 
30.1 


31.4 
32.7 
33.9 
35.2 
36.4 


37.7 
38.9 
40.1 
41.3 
42.6 


43.8 
55.8 
67.5 
79.1 


90.5 
102 
113 
124 


0.025 


5.02 

7.38 

9.35 
11.1 


12.8 
14.4 
16.0 
17.5 
19.0 


20.5 
21.9 
23.3 
24.7 
26.1 


27,5 
28.8 
30.2 
31.5 
32.9 


34,2 
35.5 
36.8 
38.1 
39.4 


40.6 
41.9 
43.2 
44.5 
45.7 


47.0 
59.3 
71.4 
83.3 


95.0 
107 
118 
130 


0.010 


6.63 

9.21 
11.3 
13.3 


15.1 
16.8 
18.5 
20.1 
21.7 


23.2 
24,7 


100 
112 
124 
136 


0.005 


7.88 
10.6 
12.8 
14.9 


16.7 
18.5 
20.3 
22.0 
23.6 


25.2 
26.8 
28.3 


104 
116 
128 
140 


0.001 


10.8 
13.8 
16.3 
18.5 


20.5 
22.5 
24,3 
26.1 
27.9 


29.6 
31.3 
32.9 
34.5 
36.1 


37.7 
39.3 
40.8 
42.3 
43.8 


45.3 
46.8 
48.3 
49.7 
$1.2 


52.6 
54,1 
55.5 
56.9 
58.3 


59.7 
73.4 
86.7 
99.6 


112 
125 
137 
149 


The probability shown at the head of the column is the area in the right-hand tail. For example, there 


is a 10% probability that a y? variable with 2 degrees of freedom would be greater than 4.61. 


Source: Adapted from Henri Theil, Principles of Econometrics, pp. 718~19. Copyright © 1971 by John 
Wiley & Sons, Inc., New York. Also Thomas H. Wonnacott and Ronald J. Wonnacott, Introductory 
Statistics, 2d ed., p. 482. Copyright © 1972 by John Wiley & Sons, Inc., New York. Reprinted by 


permission of John Wiley & Sons, Inc. 


754 Appendix B | Statistical Tables 


TABLE B.3 
The ¢ Distribution 


oe Probability that t(m) is greater than entry 
(m) 0.25 0.10 0.05 0.025 0.010 0.005 0.001 


1.000 3.078 6.314 12.706 31.821 63.657 318.31 
816 1.886 2.920 4.303 6.965 9.925 22.326 
765 =:1.638 ~—- 2.353 3.182 4.541 5.841 10.213 
741 1.533, 2.132 2.776 3.747 4.604 7.173 


1 
2 
3 
4 
5 -727 1.476 2.015 2.571 3.365 4.032 5.893 
6 718 1.440 1.943 2.447 3.143 3.707 5.208 
7 711 1.415 1.895 2.365 2.998 3.499 4.785 
8 -706 1.397 1.860 2.306 2.896 3.355 4.501 
9 -703 1.383 1.833 2.262 2.821 3,250 4.297 


10 -700 = -1.372 1.812 2.228 2.764 3.169 4.144 
11 697 1.363 —:1.796 2.201 2.718 3.106 4.025 
12 695 1.356 1.782 2.179 2.681 3.055 3.930 
13 694 1.350 1.771 2.160 2.650 3.012 3.852 
14 692 1.345 1.761 2.145 2.624 2.977 3.787 
15 691 1.341 1.753 2.131 2.602 2.947 3.733 
16 690 1.337 1.746 2.120 2.583 2.921 3.686 
17 689 1.333 = 1.740 2.110 2.567 2.898 3.646 
18 688 1.330 1.734 2.101 2.552 2.878 3.610 
19 688 1.328 1.729 2.093 2.539 2.861 3.579 
20 687 = 1.325. 1.725 2.086 2.528 2.845 3.552 
21 -686 1.323 1.721 2.080 2.518 2.831 3.527 
22 686 1.321 = 1.717 2.074 2.508 2.819 3.505 
23: 685 1.319 1.714 2.069 2.500 2.807 3.485 
24 685 =1.318 = «1.711 2.064 2.492 2.797 3.467 
25 684 1.316 1.708 2.060 2.485 2.787 3.450 
26 -684 =1.315 1.706 2.056 2.479 2.779 3.435 
27 684 1.314 = 1.703 2.052 2.473 2.771 3.421 
28 -683 1.313 ——-1.701 2.048 2.467 2.763 3.408 
29 683 1.311 1.699 - 2.045 2.462 2.756 3.396 
30 -683. 1.310 ~=—-:1.697 2.042 2.457 2.750 3.385 
40 681 1.303 1.684 2.021 2.423 2.704 3.307 
60 679 =: 1.296 1.671 2.000 2.390 2.660 3.232 
120 677 =—-:1.289 1.658 1.980 2.358 2.617 3.160 
oo 674 1.282 1.645 1.960 2.326 2.576 3.090 


The probability shown at the head of the column is the area in the right-hand tail. For example, there 
is a 10% probability that a ¢ variable with 20 degrees of freedom would be greater than 1.325. By 
symmetry, there is also a 10% probability that a ¢ variable with 20 degrees of freedom would be less 
than — 1,325. 


Source: Thomas H. Wonnacott and Ronald J. Wonnacott, Introductory Statistics, 2d ed., p. 481. 


Copyright © 1972 by John Wiley & Sons, Inc., New York. Reprinted by permission of John Wiley & 
Sons, Inc. 


Appendix B | Statistical Tables 755 


TABLE B.4 


The F Distribution 
Denominator 
pe hid Numerator degrees of freedom (m,) 
(my) I 2 3 4 5 6 7 8 9 10 
1 161 200 216 225 230 234 237 239 241 242 
4052 4999 5403 5625 5764 5859 5928 5981 6022 6056 
2 18.51 19.00 19.16 19.25 19.30 19.33 19.36 19.37 19.38 19.39 
98.49 99.00 99.17 99.25 99.30 99.33 99.34 99.36 99.38 99.40 
3 10.13 9.55 9.28 9.12 9.01 894 888 884 881 8.78 
34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.34 27.23 
4 7.71 6.94 659 639 626 616 6.09 604 6.00 5.96 
21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.54 
5 6.61 5.79 5.41 5.19 505 495 488 4.82 4.78 4.74 
16.26 13.27 12.06 11.39 10.97 10.67 10.45 10.27 10.15 10.05 
6 $.99 5.14 4.76 453 439 428 421 415 410 4.06 
13.74 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 
7 §.59 4.74 435 412 3.97 387 3.79 3.73 3.68 3.63 
12.25 9.55 8.45 7.85 7.46 7.19 7.00 6.84 6.71 6.62 
8 §.32 446 4.07 3.84 3.69 3.58 3.50 344 3,39 3,34 
11.26 8.65 7.59 7.01 663 637 6.19 6.03 5.91 5.82 
9 §.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.13 
10.56 8.02 6.99 6.42 6.06 5.80 5.62 5.47 5.35 5.26 
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.97 
10.04 7.56 6.55 5.99 5.64 5.39 5.21 5.06 4.95 4.85 
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.86 
9.65 7.20 6.22 5.67 5.32 5.07 4.88 4.74 4.63 4.54 
12 4.75 3.88 3.49 3.26 3.11 3.00 2.92 2.85 2.80 2.76 
9.33 6.93 5.95 5.41 5.06 482 465 450 4.39 4.30 
13 4.67 3.80 3.41 3.18 3.02 2.92 2.84 2.77 2.72 2.67 
9.07 6.70 5.74 5.20 4.86 462 444 4.30 4.19 4.10 
14 4.60 3.74 3.34 3.11 2.96 2.85 2.77 2.70 2.65 2.60 
8.36 651 5.56 5.03 469 446 4.28 4.14 4.03 3.94 
15 4.54 3.68 3.29 3.06 2.90 2.79 2.70 2.64 2.59 2.55 
8.68 6.36 5.42 489 456 4.32 4.14 4.00 3.89 3.80 
16 449 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 
8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 
17 4.45 3.59 3.20 2.96 2.81 2.70 2.62 2.55 2.50 2.45 
8.40 611 5.18 467 434 410 3.93 3.79 3.68 3.59 
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 
8.28 6.01 5.09 458 425 4.01 3.85 3.71 3.60 3.51 
19 4.38 3.52 3.13 2.90 2.74 2.63 2.55 2.48 2.43 2.38 
8.18 5.93 5.01 450 4.17 3.94 3.77 3.63 3.52 3.43 


(continued on page 758) 


756 Appendix B | Statistical Tables 


1l 


243 
6082 
19.40 
99.41 
8.76 
27,13 
5.93 
14.45 


4.70 
9.96 
4.03 
7.79 
3.60 
6.54 
3.31 
5.74 
3.10 
5.18 


2.94 
4.78 
2.82 
4.46 
2.72 
4.22 
2.63 
4.02 
2.56 
3.86 


2.51 
3.73 
2.45 
3.61 
2.41 
3.52 
2.37 
3.44 
2.34 
3.36 


12 
244 
6106 
19.41 
99.42 
8.74 
27.05 
5.91 
14.37 


4.68 
9.89 
4.00 
7.72 
3.57 
6.47 
3.28 
5.67 
3.07 
5.11 


2.91 
4.71 
2.79 
4.40 
2.69 
4.16 
2.60 
3.96 
2.53 
3.80 


2.48 
3.67 
2.42 
3.55 
2.38 
3.45 
2.34 
3.37 
2.31 
3.30 


14 
245 
6142 
19.42 
99.43 
8.71 
26.92 


5.87 
14.24 


4.64 
9.77 
3.96 
7.60 
3.52 
6.35 
3.23 
5.56 
3.02 
5.00 


2.86 
4.60 
2.74 
4.29 
2.64 
4.05 
2.55 
3.85 
2.48 
3.70 


2.43 
3.56 
2.37 
3.45 
2.33 
3.35 
2.29 
3.27 
2.26 
3.19 


16 


246 
6169 
19.43 
99.44 
8.69 
26.83 
5.84 
14.15 


20 
248 
6203 
19.44 
99.45 
8.66 
26.69 


5.80 
14.02 


4.56 
9.55 


3.87 
7.39 
3.44 
6.15 
3.15 
5.36 
2.93 
4.80 


2.77 
441 
2.65 
4.10 
2.54 
3.86 
2.46 


40 50 75 
251 252-253 
6286 6302 6323 
19.47 19.47 19.48 
99.48 99.48 99.49 
8.60 8.58 8.57 
26.41 26.35 26.27 
5.71 5.70 5.68 
13.74 13.69 13.61 


446 4.44 4.42 
9.29 9.24 9.17 
3.77 3.75 3.72 
7.14 7.09 7.02 
3.34 3.32 3.29 
5.90 5,85 5.78 
3.05 3.03 3.00 
§.11 5.06 5.00 
2.82 2.80 2.77 
4.56 4.51 4.45 


2.67 2.64 2.61 
4.17 4.12 4.05 
2.53 2.50 2.47 
3.86 3.80 3.74 
2.42 2.40 2.36 
3.61 3.56 3.49 
2.34 2.32 2.28 
3.42 3.37 3.30 
2.27. 2.24 2.21 
3.26 3.21 3.14 


2.21 2.18 2.15 
3.12 3.07 3.00 
2.16 2.13 2.09 
3.01 2.96 2.89 
2.11 2.08 2.04 
2.92 2.86 2.79 
2.07 2.04 2.00 
2.83 2.78 2.71 
2.02 2.00 1.96 
2.76 2.70 2.63 


100 
253 
6334 
19.49 
99.49 
8.56 
26.23 
5.66 
13.57 


4.40 
9.13 
3.71 
6.99 
3.28 
5.75 
2.98 
4.96 
2.76 
4.41 


2.59 
4.01 
2.45 
3.70 
2.35 
3.46 
2.26 
3.27 


200 
254 
6352 
19.49 
99.49 
8.54 
26.18 


500 
254 


wo 


254 


6361 6366 
19.50 19.50 
99.50 99.50 


8.54 


8.53 


26.14 26.12 


5.64 


5.63 


13.48 13.46 


4.37 
9.04 
3.68 
6,90 
3.24 
5.67 
2.94 
4.88 
2.72 
4.33 


2.55 
3.93 
2.41 
3.62 
2.31 
3.38 
2.22 
3.18 
2.14 
3.02 


2.08 
2.89 
2.02 
2.77 
1.97 
2.67 
1.93 
2.59 
1.90 
2.51 


4.36 
9.02 
3.67 
6.88 
3.23 
5.65 
2.93 
4.86 
2.71 
4.31 


2.54 
3.91 


Appendix B | Statistical Tables 


757 


TABLE B.4 (continued) 


Denominator 
degrees of 
freedom 
(m) 


20 


2 


32 


40 


42 


7.12 


(continued on page 760) 


\ 


Numerator degrees of freedom (m,) 


4 


2.87 
4.43 
2.84 
4.37 
2.82 
431 
2.80 
4,26 
2.78 
4.22 


2.76 
4.18 
2.74 
4.14 
2.73 
411 
2.71 
4.07 
2.70 
4.04 


2.69 
4.02 
2.67 
3.97 
2.65 
3.93 
2.63 
3.89 
2.62 
3.86 


2.61 
3.83 
2.59 
3.80 
2.58 
3.78 
2.57 
3.76 
2.56 
3.74 


2.56 
3.72 
2.54 
3.68 


758 Appendix B | Statistical Tables 


5 


2.71 
4.10 
2,68 
4.04 
2.66 
3.99 
2.64 
3.94 
2.62 
3.90 


6 


7 


2.52 
3.71 


2.49 
3.65 


2.47 
3.59 


2.45 


8 


2.45 
3.56 
2.42 
3.51 
2.40 
3.45 
2.38 
3.41 
2.36 
3.36 


2,34 
3.32 
2.32 
3.29 


2.37 


2.35 
3.35 


2.32 


2.30 


3.21 
2.27 
3.17 


3.14 
2.24 
3.1 
2.22 
3.08 


2.21 
3.06 
2.19 
3.01 
2.17 
2.97 
2.15 
2.94 
2.14 
2.91 


2.12 
2.88 
2.11 
2.86 
2.10 
2.84 
2.09 
2.82 
2.08 
2.80 
2.07 


2.78 


2.05 
2.75 


1.98 
2.62 


2.58 


1.95 
2.56 
1.93 
2.53 


Appendix B | Statistical Tables 


759 


Denominator 

vss bedes Numerator degrees of freedom (m,) 
(m2) I 2 3 4 5 6 7 8 9 10 
60 4.00 3.15 2.76 2.52 2.37 2.25 2.17 2.10 2.04 1.99 
7.08 498 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 
65 3.99 3.14 2.75 2.51 2.36 2.24 2.15 2.08 2.02 1.98 
7.04 495 4.10 3.62 3.31 3.09 2.93 2.79 2.70 2.61 
70 3.98 3.13 2.74 250 2.35 2.23 2.14 2.07 2.01 1.97 
7.01 492 4.08 3.60 3.29 3.07 2.91 2.77 2.67 2.59 
80 3.96 3.11 2.72 2.48 2.33 2.21 2.12 2.05 1.99 1.95 
6.96 488 4.04 3.56 3.25 3.04 2.87 2.74 2.64 2.55 
100 3.94 3.09 2.70 2.46 2.30 2.19 2.10 2.03 1.97 1.92 
6.90 482 3.98 3.51 3.20 2.99 2.82 2.69 2.59 2.51 
125 3.92 3.07 2.68 2.44 2.29 2.17 2.08 2.01 1.95 1.90 
6.84 4.78 3.94 3.47 3.17 2.95 2.79 2.65 2.56 2.47 
150 3.91 3.06 2.67 2.43 2.27 2.16 2.07 2.00 1.94 1.89 
6.81 4.75 3.91 3.44 3.14 2.92 2.76 2.62 2.53 2.44 
200 3.89 3.04 2.65 2.41 2.26 2.14 2.05 1.98 1.92 1.87 
6.76 4.71 3.88 3.41 3.11 2.90 2.73 2.60 2.50 2.41 
400 3.86 3.02 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 
6.70 466 3.83 3.36 3.06 2.85 2.69 2.55 2.46 2.37 
1000 ” 3.85 3.00 2.61 2.38 2.22 2.10 2.02 195 1.89 1.84 
6.66 462 3.80 3.34 3.04 2.82 2.66 2.53 2.43 2.34 
oo 3.84 2.99 2.60 2.37 2.21 2.09 2.01 1.94 1.88 1.83 


6.64 460 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 


The table describes the distribution of an F variable with m, numerator and m, denominator degrees 
of freedom. Entries in the standard typeface give the 5% critical value, and boldface entries give the 
1% critical value for the distribution. For example, there is a 5% probability that an F variable with 2 
numerator and 50 denominator degrees of freedom would exceed 3,18; there is only a 1% probability 
that it would exceed 5.06. 


Source: George W. Snedecor and William G. Cochran, Statistical Methods, 8th ed. Copyright 1989 by 
Iowa State University Press. Reprinted by permission of Iowa State University Press. 


760 Appendix B | Statistical Tables 


1.65 
2.03 


2.00 
1.62 
1.98 


1,94 
1.57 
1.89 


1.55 
1.85 
1.54 
1.83 
1.52 
1.79 
1.49 
1.74 
1.47 
1.71 


1.46 
1.69 


Appendix B | Statistical Tables 


761 


TABLE B.5 a 
Critical Values for the Phillips-Perron Z, Test and for the Dickey-Fuller Test 
Based on Estimated OLS Autoregressive Coefficient 


Sample 
size 
T 


0.01 


Probability that T(p — 1) is less than entry 


0.025 


0.05 


0.10 


0.90 


0.95 


0.975 


0.99 


Case 1 


co 


-11.9 
—12,9 
—13.3 
—13.6 
-13.7 
—13.8 


-17.2 
— 18.9 
—19.8 
—20.3 
—20.5 
—20.7 


—22.5 
—25.7 
—27.4 
—28.4 
— 28.9 
—29.5 


—9.3 
-9.9 
—10.2 
—10.3 
—10.4 
~10.5 


—14.6 
—15.7 
— 16.3 
~ 16.6 
— 16.8 
-16.9 


-19.9 
—22.4 
—23.6 
—24.4 
—24.8 
—25.1 


—7.3 
-7.7 
-7.9 
—8.0 
—8.0 
-8.1 


—12.5 
—13.3 
—13.7 
—14.0 
— 14.0 
-14.1 


-17.9 
—19.8 
—20.7 
—21.3 
—21.5 
—21.8 


—5.3 
—5.5 
—5.6 
—5.7 
—5.7 
—53.7 


Case 2 
—10.2 
—10.7 
-11.0 
-—11.2 
-11.2 
-11.3 


Case 4 
-15.6 
—16.8 
-17.5 
—18.0 
-18.1 
—18.3 


1.01 
0.97 
0.95 
0.93 
0.93 
0.93 


— 0.76 
—0.81 
—0.83 
—0.84 
— 0.84 
—0.85 


— 3.66 
—3.71 
—3,74 
—3.75 
—3.76 
—3.77 


1.40 
1.35 
1.31 
1.28 
1.28 
1.28 


0.01 
—0.07 
—0.10 
—0.12 
—0.13 
—0.13 


—2.51 
—2.60 
— 2.62 
—2.64 
—2.65 
—2.66 


1.79 
1.70 
1.65 
1.62 
1.61 
1.60 


0.65 
0.53 
0.47 
0.43 
0.42 
0.41 


—1.53 
— 1.66 
—1.73 
—1.78 
—1.78 
—1.79 


The probability shown at the head of the column is the area in the left-hand tail. 


~ 0.87 


Source: Wayne A. Fuller, Introduction to Statistical Time Series, Wiley, New York, 1976, p. 371. 


762 Appendix B | Statistical Tables 


TABLE B.6 
Critical Values for the Phillips-Perron Z, Test and for the Dickey-Fuller Test 
Based on Estimated OLS t Statistic 


SE tt en cnereen® 


Sample Probability that (f — 1)/6;, is less than entry 
size 
T 0.01 0.025 0.05 0.10 0.90 0.95 0.975 0.99 


Case 1 
25 —2.66 —2.26 -1.95 -1.60 0.92 1.33 1.70 2.16 
50 —2.62 —-2.25 -1.95 —-1.61 0.91 1.31 1.66 2.08 
100 —2.60 -—2.24 -1.95 -1.61 0.90 1.29 1.64 2.03 
250 —2.58  -2.23 ~-1.95 —1.62 0.89 1.29 1.63 2.01 
§00 —2.58 —2.23 -1.95 —1.62 0.89 1.28 1.62 2.00 
oo —2.58 -—2.23 -1.95  -—1.62 0.89 1.28 1.62 2.00 


Case 2 


25 -3.75  -3.33 -3.00 -2.63 -—0.37 0.00 0.34 0.72 
50 —3.58 -—3.22 -2.93 -2.60 -0.40 -0.03 0.29 0.66 
100 —3.51 —3.17 -2.89 -2.58 -0.42 -0.05 0.26 0.63 
250 -3.46 -3.14 -2.88 -2.57 -0.42 -0.06 0.24 0.62 
500 -3.44 -3.13° -2.87 -—2.57 -0.43 -0.07 0.24 0.61 
oo —3.43  -3.12 -2.86 -—2.57 -—0.44 —-0.07 0.23 0.60 


Case 4 
25 —4.38 —-3.95 -3.60 -—3.24 -1.14 -0.80 -0.50 -—0.15 
50 -4.15 -3.80 -3.50 -3.18 -1.19 -0.87 -0.58 —-0.24 
100 —4.04 -3.73 -3.45 —-3.15 -1.22 -0.90 -0.62 —0.28 
250 -3.99 -3.69 -3.43 -3.13 -1.23 -0.92 -0.64 —-0.31 
500 —3.98 -3.68 —3.42 -3.13 -1.24 -0.93 -0.65 —0.32 
oo —3.96 -3.66 —3.41 -3.12 -1.25 -0.94 -0.66 —0.33 
The probability shown at the head of the column is the area in the left-hand tail. 


Source: Wayne A. Fuller, Introduction to Statistical Time Series, Wiley, New York, 1976, p. 373. 


Appendix B | Statistical Tables 763 


TABLE B.7 
Critical Values for the Dickey-Fuller Test Based on the OLS F Statistic 


ae le Probability that F test is greater than entry 


T 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 


Case 2 
(F test of a = 0, p = 1 in regression y, = a + py,_, + U,) 
25 0.29 0.38 0.49 0.65 4.12 5.18 6.30 7.88 
50 0.29 0.39 0.50 0.66 3.94 4.86 5.80 7.06 
100 0.29 0.39 0.50 0.67 3.86 4.71 5.57 6.70 
250 0.30 0.39 0.51 0.67 3.81 4.63 5.45 6.52 
500 0.30 0.39 0.51 0.67 3.79 4.61 5.41 6.47 
9 0.30 0.40 0.51 0.67 3.78 4.59 5.38 6.43 


Case 4 
(F test of 6 = 0, p = 1 in regression y, = @ + dt + py,_, + u,) 
25 0.74 0.90 1.08 1.33 $.91 7.24 8.65 10.61 
50 0.76 0.93 1.11 1.37 5.61 6.73 7.81 9.31 
100 0.76 0.94 1.12 1.38 $.47 6.49 7.44 8.73 
250 0.76 0.94 1.13 1.39 $.39 6.34 7.25 8.43 
500 0.76 0.94 1.13 1.39 $.36 6.30 7.20 8.34 
oo 0.77 0.94 1.13 1.39 $.34 6.25 7.16 8.27 
The probability shown at the head of the column is the area in the right-hand tail. 


Source: David A. Dickey and Wayne A. Fuller, “Likelihood Ratio Statistics for Autoregressive Time 
Series with a Unit Root,” Econometrica 49 (1981), p. 1063. 


764 Appendix B | Statistical Tables 


TABLE B.8 
Critical Values for the Phillips Z, Statistic When Applied to Residuals 
from Spurious Cointegrating Regression 


Number of 
right-hand 
variables in 
regression, 

excluding 

ae Saree Probability that (T — 1)(p — 1) is less than entry 


(n - 1) (T) 0.010 0.025 0.050 0.075 0.100 0.125 0.150 
Case 1 


1 500 —22.8 —-18.9 -15.6 -13.8 -12.5 -—11.6 —-10.7 
2 500 —29.3 —-25.2 -21.5 -19.6 -18.2 -17.0 -16.0 
3 500 —36.2 —31.5 —-27.9 -25.5 -23.9 -22.6 —21.5 
4 $00 —42.9 —37.5 -—33.5 -30.9 -28.9 -27.4 —-26.2 
Bs) $00 —48.5 —42.5 -38.1 -—35.5 -—33.8 -—32.3 —30.9 
Case 2 
1 500 —28.3  —23.8 -20.5 -18.5 -17.0 -15.9 -14.9 
2 500 —34.2 -29.7 -26.1 -23.9 -22.2 -21.0 -19.9 
3 500 —41.1 —35.7. —32.1 —-29.5 -27.6 -26.2 —-25.1 
4 500 —47.5 -41.6 —-37.2 -34.7 -32.7 -—31.2 -29.9 
5 500 —§52.2 -46.5 -41.9 -39.1 -37.0 -35.5 -—34.2 
Case 3 
1 §00 -—28.9 -24.8 -—21.5 — —18.1 — — 
2 500 —35.4 —30.8 —-27.1 -24.8 -23.2 -—21.8 -20.8 
3 §00 —40.3 —36.1 —32.2 -29.7 -27.8 -26.5 ~25.3 
4 500 —47.4 -42.6 -—37.7 -35.0 -33.2 —-31.7 -—30.3 
5 §00 —53.6: —47.1 -42.5 -39.7 -—37.7 -—36.0 ~-34.6 


The probability shown at the head of the column is the area in the left-hand tail. 


Source: P. C. B. Phillips and S. Ouliaris, “Asymptotic Properties of Residual Based Tests for Coin- 
tegration,” Econometrica $8 (1990), pp. 189-90. Also Wayne A. Fuller, Introduction to Statistical Time 
Series, Wiley, New York, 1976, p. 371. 


Appendix B | Statistical Tables 765 


TABLE B.9 


Critical Values for the Phillips Z, Statistic or the Dickey-Fuller ¢ Statistic When 
Applied to Residuals from Spurious Cointegrating Regression 


Number of 
right-hand 
variables in 
regression, 
excluding 
trend or 
constant 
(n - 1) 


AWN UARUNR 


Me WN RP 


- 
Sample 
size 
(T) 0.010 
500 = — 3.39 
500 — 3.84 
500. — 4.30 
500 — 4.67 
500 —4.99 
500 — 3.96 
500. — 4.31 
500 —4.73 
500 —5.07 
+ 500 - 5.28 
500 —3.98 
500 — 4.36 
500 = —4.65 
500. = —5.04 
500 ~5.36 


Probability that (p — 1)/6, is less than entry 


0.025 0.050 
Case 1 
—3.05 —2.76 
—3.55 —3.27 
—3.99 -—3.74 
—4.38 —4.13 
-4.67 -—4.40 
Case 2 
-—3.64 —3.37 
—4.02 -—3.77 
—4.37 -4.11 
-4.71 —4.45 
—4.98 -—4.71 
Case 3 
—3.68 —3.42 
—4.07 —3.80 
—4.39 —4.16 
-4.77 —4.49 
—5.02 —4.74 


0.075 0.100 0.125 0.150 
-2.58 -2.45 -2.35 —2.26 
—3.11 -2.99 -2.88 -2.79 
—3.57 -3.44 —3.35 —3.26 
~3.95 -3.81 -3.71 -3.61 
-4.25 -4.14 -4.04 ~3,94 
-3.20 -3.07 -2.96 —2.86 
-3.58 -3.45 -3.35 —3.26 
-3.96 —3.83 -3.73 —3.65 
-4.29 -4.16 -4.05 -3.96 
-4.56 4.43 -4.33 -4.24 
as 
-3.65 —352 -3.42 -3,33 
-3.98 -3.84 -3.74 -3.66 
-4.32 -4.20 -4.08 —4.00 
—4.58 4.46 —4.36 —4.28 


The probability shown at the head of the column is the area in the left-hand tail. 


Source: P. C. B. Phillips and S. Ouliaris, “Asymptotic Properties of Residual Based Tests for Coin- 
tegration,” Econometrica 58 (1990), p. 190. Also Wayne A. Fuller, Introduction to Statistical Time 


Series, Wiley, New York, 1976, p. 373. 


766 Appendix B 


Statistical Tables 


TABLE B.10 
Critical Values for Johansen’s Likelihood Ratio Test of the Null Hypothesis 
of h Cointegrating Relations Against the Alternative of No Restrlctions 


Number of 
piesa et i a Probability that 2(£, — Lo) is greater than entry 
(g) (T) 0.500 0.200 0.100 0.050 0.025 0.001 
Case 1 
1 400 0.58 1.82 2.86 3.84 4.93 6.51 
2 400 5.42 8.45 10.47 12.53 14.43 16.31 
3 400 14.30 18.83 21.63 24.31 26.64 29.75 
4 400 27.10 33.16 36.58 39.89 42.30 = 45.58 
5 400 43.79 51.13 55.44 59.46 62.91 66.52 
Case 2 
1 400 2.415 4.905 6691 8.083 9.658 11.576 
2 400 9.335 13.038 15.583 17.844 19.611 21.962 
3 400 20.188 25.445 28.436 31.256 34.062 37.291 
4 400 34.873 41.623 45.248 48.419 51.801 55.551 
5 400 53.373 61.566 65.956 69.977 73.031 77.911 
Case 3 
1 400 0.447 1.699 2.816 3.962 5.332 6.936 
2 400 7.638 11.164 13.338 15.197 17.299 19.310 
3 400 18.759 23.868 26.791 29.509 32.313 35.397 
4 400 33.672 40.250 43.964 47.181 50.424 53.792 
5 400 52.588 60.215 65.063 68.905 72.140 76.955 


PE a i SD aT LEE Re TO ae EE LE 
The probability shown at the head of the column is the area in the right-hand tail. The number of 
random walks under the null hypothesis (g) is given by the number of variables described by the 
vector autoregression (n) minus the number of cointegrating relations under the null hypothesis (h). 
In each case the alternative is that g = 0. 


Source: Michael Osterwald-Lenum, “A Note with Quantiles of the Asymptotic Distribution of the 
Maximum Likellhood Cointegration Rank Test Statistics,” Oxford Bulletin of Economics and Sta- 
tistics 54 (1992), p. 462; and Sdren Johansen and Katarina Juselius, “Maximum Likelihood Esti- 
mation and Inference on Cointegration—with Applications to the Demand for Money,” Oxford 
Bulletin of Economics and Statistics 52 (1990), p. 208. 


Appendix B | Statistical Tables 167 


TABLE B.11 
Critical Values for Johansen’s Likelihood Ratio Test of the Null Hypothesis 
of fh Cointegrating Relations Against the Alternative of h + 1 Relations 


Number of 
random walks Sample 


Gan hy ‘eine Probability that 2(£, — Lo) is greater than entry 


(g) 7B, 0.500 0.200 0.100 0.050 0.025 0.001 
Case 1 
1 400 0.58 1.82 2.86 3.84 4.93 6.51 
2 400 4.83 7.58 9.52 11.44 13.27 15.69 
3 400 9.71 13.31 15.59 17.89 20.02 22.99 
4 400 14.94 18.97 21.58 23.80 26.14 28.82 
5 400 20.16 24.83 27.62 30.04 32.51 35.17 
Case 2 
1 400 2.415 4.905 6691 8.083 9.658 11.576 
2 400 7.474 10.666 12.783 14.595 16.403 18.782 
3 400 12.707 16.521 18.959 21.279 23.362 26.154 
4 400 17.875 22.341 24.917 27.341 29.599 32.616 
5 400 23.132 27.953 30.818 33.262 35.700 38.858 
Case 3 
1 400 0.447 1.699 2.816 3.962 5.332 6.936 
2 400 6.852 10.125 12.099 14.036 15.810 17.936 
3 400 12.381 16.324 18.697 20.778 23.002 25.521 
4 400 17.719 22.113 24.712 27.169 29.335 31.943 
5 400 23.211 27.899 30.774 33.178 35.546 38.341 


The probability shown at the head of the column is the area in the right-hand tail, The number of 
random walks under the null hypothesis (g) is given by the number of variables described by the 
vector autoregression (n) minus the number of cointegrating relations under the null hypothesis (h). 
In each case the alternative is that there are h + 1 cointegrating relations. 


Source; Michael Osterwald-Lenum, “A Note with Quantiles of the Asymptotic Distribution of the 
Maximum Likelihood Cointegration Rank Test Statistics,” Oxford Bulletin of Economics and Sta- 
tistics $4 (1992), p. 462; and Sgren Johansen and Katarina Juselius, “Maximum Likelihood Esti- 
mation and Inference on Cointegration—with Applications to the Demand for Money,” Oxford 
Bulletin of Economics and Statistics 52 (1990), p. 208. 


768 Appendix B | Statistical Tables 


C 


Answers 
to Selected Exercises 


Chapter 3. Stationary ARMA Processes 


3.1. Yes, any MA process is covariance-stationary. Autocovariances: 


Yo = 7.4 
Ya; = 4.32 
Ye. = 0.8 


y= 0 ~— for |j| > 2. 
3.2. Yes, the process is covariance-stationary, since 
(1 — 1.1z + 0.182?) = (1 — 0.9z)(1 — 0.2z); 


the eigenvalues (0.9 and 0.2) are both inside the unit circle. The autocovariances are as 
follows: 


Yo = 7.89 

y = 7.35 

y = 1.1y-, — 0.18y-. forj = 2,3,... 
Y-) = Ve 


3.3. Equating coefficients on: 
L° gives wh = 1 
L' gives ~dith +, = 0 
L? gives ~— ath) — Git + Y2 = 0 


Li gives ~pWj-p — bp-sWj-p+1 — °° — Pil + Y = 0, 


forj=p,p+1,.... 
These imply 
% = 1 
w= & 


wh = gi + gb 


% = diye. + Goaj-r2t +++ + Op, forj=p,p+1,.... 
Thus the values of yy are the solution to a pth-order difference equation with starting values 


YH = Land w_, = p-2 = +++ = W541 = 0. Thus, from the results on difference equations, 
Wj 1 
Wy = Fi 0 ; 
Ups 0. 


769 


that is 


3.4. From [2.1.6], 
WL)c = (to th tithe + wets s)jce 


But the sum (it + uy + % + Ys + ° ++) can be viewed as the polynomial y(z) evaluated 
atz=1: 
w(L)e = W(1)-c. 
Moreover, from [3.4.19], 
w(1) = (1 ~ b — ¢). 
3.5. Let A, and A, satisfy (1 — oz — $27) = (1 — A,z)(1 — A2z), noting that A, and A, 


are both inside the unit circle for a covariance-stationary AR(2) process. 
a “Consider first the case where A, and A, are real and distinct. Then from [1.2.29], 


Sl = 3 lead + cal 


< Deal + > least 

j=0 j=0 
= Je — Al) + lel/ — |Aal) 
< @, 


Consider next the case where A, and A, are distinct complex conjugates. Let R = |A,| 
denote the modulus of A, or A,. Then 0 = R < 1, and from [1.2.39], 


» SD lu = 2 leah + €2A4I 
j=d /=0 
= }; |2aR/ cos(@j) — 2BR/ sin(4j)| 
=o 
< |2a| > Ré|cos(a)| + [28] > R'|sin(a7)| 
jo j= 


<|2a| >) R/ + |26| > R/ 
j=0 i=o 
= 2a] + |B] — R) 
<@, 
Finally, for the case of a repeated real root |A| < 1, 


Dy lvl = 2s lea! + kaj] S|] DAP + [eal [al 
j=0 j=0 j=0 =O 


i= 


But 
Iki] 3 VAL! = [eal — Jal) < 
- 
and 
> |jv-t] = 1 + 2a] + 3fal? + 4ape+--- 
j=0 
= 1+ (JA] + Al) + (lal? + lal? + 1A]?) 
+ (JA)? + [A]? + JAI? + Jal?) + 
= (1 + [Al + Jal? + [Ale + ++) + CAL + (al? + [Aa + +) 
+ CIAP + [A/F 4 *) 
= 11 — JA) + JAG = Jal) + [AP = [Al) +++: 
= ul - [Al 
< ©, 


3.8. (1 + 2.42 + 0.822) = (1 + 0.4z)(1 + 22). 


770 Appendix C | Answers to Selected Exercises 


The invertible operator is : 
(1 + 0.4z)(1 + 0.5z) = (1 + 0.9z + 0.227), 
so the invertible representation is 
Y, = (1 + 0.9L + 0.2L2)e, 


E(e?) = 4. 


Chapter 4. Forecasting 


1 0 O}f1 0 O}f1 -2 3 
43. ]-2 1 O//0 2 O}]0 1 1 
3. 1 1310 0 1J/{0 O 1 


4.4. No. The projection of Y, on Y;, Y,, and Y, can be calculated from 
P(VAI¥s, Yor Yi) = aa¥s + del ¥2 — PCY) + aol ¥s — PC¥SI¥2, YI. 
The projection P(Y3|Y2, Y,), in turn, is given by 
P(Ys|¥2, Yi) = ani + anl¥. — P(YAIY)]. 
The coefficient on Y, in P(Y,|Y3, Y;, Y;) is therefore given by a. — Q4323). 


Chapter 5. Maximum Likelihood Estimation 
5.2. The negative of the matrix of second derivatives is 


0 
H(®) = [3 At 
so that [5.7.12] implies 


~-GB TE 


Chapter 7. Asymptotic Distribution Theory 


7.1. By continuity, |g(X7, cr) — g(é, c)| > 8 only if |X; — é| + |c; — ¢| > 7 for some 
n. But cp—> c and X;-> & means that we can find an N such that |c, — c| < 7/2 for all 
T 2 N and such that P{|X, — é| > 7/2} < « for all T = N. Hence P{|X, — é| + 
lc — c| > n} is less than ¢ for all T = N, implying that P{|g(X;, cr) — g(é, c)| > d} < e. 


7.2. (a) Foran AR(1) process, w(z) = 1/(1 - $z) and gy(z) = o7/(1 — z)(1 — 627'), 
with 
a? 1 
1) = —————_. = >= 
a) = Gea) a 08F 
Thus lim,.... T-Var(Y;) = 25. ° 
(b) T = 10,000 (V/25/10,000 = 0.05). 
7.3. No, the variance can be a function of time. 


= 25. 


7.4. Yes, ¢, has variance g? for all t. Since ¢, is a martingale difference sequence, it has 
mean zero and must be serially uncorrelated. Thus {¢,} is white noise and this is a covariance- 
stationary MA(%) process. 


7.7, From the results of Chapter 3, Y, can be written as Y, = # + Zfrowe,., with 
Zfoly| < 0. Then (a) follows immediately from Proposition 7.5 and result (3.3.19). For 
(b), notice that Ele,|" < © for r = 4, so that result [7.2.14] establishes that 


(ur — WLS V.%n > EY»), 


Answers for Chapter 7 771 


where Y, = Y, — p. But 


\ 


(UT ~ WY, V¥ne = UT OS et we + w) 


= (UT WY, Foe + MT ~ OLY Fane 


+ pT - k} > Y, + p? 


tmke+1 
+ E(V,Y,.) +0404 
= E(Y, + w)(Y,-+ + 7) 
= E(Y,Y,-+). 


Chapter 8. Linear Regression Models 
y'X(X’X)-'X‘y 


81. R2= . 
yy 
_ yy — yl, — X(X'X)'X']y 


y'y 
1 — [(y‘MxMxy)/(y‘y)] 
1 — [(@’A)(y'y)}. 
R? = YY Y'My — Ty? 
‘: y'y — Ty? 
=1—- [@’ay(y’y — TY?) 


and 
T T 
yy - Ty? = 2% - Ty? = > (y, — y)?. 


8.2. The 5% critical value for a y?(2) variable is 5.99. An F(2, N) variable will thus have 
a critical value that approaches 5.99/2 = 3.00 as N — ™. One needs N of around 300 
observations before the critical value of an F(2, N) variable reaches 3.03, or within 1% of 
the limiting value. 


8.3. Fourth moments of x,u, are of the form E(e#)-E(y,_.¥,-;¥:-1Yr-m). The first term is 
bounded under. Assumption 8.4, and the second term is bounded as in Example 7.14. 
Moreover, a typical element of (1/T) =7 ,u?x,x; is of the form 


T T T 
(1/T) py €2Y,— Vij = (v/T) > (e? “e a) Pe + o?-(1/T) >> Yr-idr-j 


4 O+ a? E(y,_jYr~)- 
Hence, the conditions of Proposition 7.9 are satisfied. 
8.4. Proposition 7.5 and result [7.2.14] establish 


-1 


e 1 (VT)3y,-1 +++ (V/T)3y,_, (/T)2y, 
A 
bir] — | G/T)Ey-1 (UT )By71 + VT) By -1¥.-2 (V/T)Zy,-1¥ 
Oper (UT)Sy,_p CU/T)Ey-p¥e-1 °° + (UT)Sy2., (1T)3Y,-pY, 
1 ph ee m ee ph 
’, # eH eg Wie ee a 
Hoa te ts Yt Yp + 


which equals a? given in [4.3.6]. 


772 Appendix C | Answers to Selected Exercises 


Chapter 10. Covariance-Stationary Vector Processes 


(1 + @)e2 h,@02 


We @) Poe h, os {hi(1 + 62)02 + a2} 


0 
Bi Pres ¢ + ae | 


r h, a | 
r_4,= r.,=f, 

a fork = +3, +4,.... 

(b) sy(w) = ony[ ‘| 
Sq, S22 

Sy, = (1 + @)o2 + @a2e- + Boze 

Siz = hyOo2e + (1 + O?)o2e + 4,607 

Sq, = h\Oore-* + h,(1 + O?)o2e-™ + h,602 

So = AF(1 + O)o? + 02 + h?Oa2e-” + hi@a2e 

Cyx(w) = (277) 'h,o2{@-cos(2w) + (1 + 67)-cos(w) + O} 


Qrx(w) = —(2m)“"h,oAG-sin(2w) + (1 + 7)-sin(w)}. 
(c) The variable X, follows an MA(1) process, for which the spectrum is indeed s,,. 


The term s,, is s,, times h(e~”) = h,-e~™, Multiplying s2, in turn by h(e”) = h,-e™ and 
adding a? produces 5s... 


(d) en f oe dw = em {" hy -e~#eiwk dw, 
When k = 1, this is simply 


em {7 hese = he, 
as desired. When k # 1, the integral is 
en |" hye dep 


= (27)7! {e h,:cos{(kK — 1)a] dw + i-(27)-! ie hy sin{(k — lo] dw 


= [kK — I2a}-'h, 


sin[(k — all — [(k - 12a] “eos - al 


w= — 77 a 7 


= 0. 


Chapter 11. Vector Autoregressions 
11.1. A typical element of [11.A.2] states that 


T 
(1/T) > Enis EhsVigt—h ae E (Eine) EV je-qVae-t) 
But 


T T r 
(v/T) »» EqaYiys-yEhsVat—k, = (VT) >» Z,+ E(€,1£,2)°(Q/T) > Viyt- tJ at-bs 
where 
2,= {6,484 = E (Enea ase Vana he 


Notice that z, is a martingale difference sequence whose variance is finite by virtue of 


Answers for Chapter 11 773 


Proposition 7.10, Hence, (1/T) 7. ,z,—> 0. Moreover, 
T 
(UT) D Ype-Wne-a > EVs Vne-4)> 


by virtue of Proposition 10.2(d). 
11.2. (a) No. (b) Yes.  (¢) No. 


11.3. a =G forj=1,2,...,p 
B= 1 forj=1,2,...,p 
Ay = 9,07) 
Ay = ¥ — A DG'a, forj=1,2,...,p 
& = 8 — 2,07 '8, forj=1,2,...,p 
oF =O 
oF = Oy — AyDF'Q2 
uy, = Ey 


U2, = Ex, — NDF", 
11.4. Premultiplying by A*(L) results in 


ke 0 i = hare n(L) libs 

0 JA(L)II Lyx Ao + ML) 1- g(L) IL ux. 

-[ [1 — &(L) |, + 2(L)ue, 
[Ao + A(L)]uy, + [1 — (Lue, 


scat Vie 
2,1 
JA(L) yu, = Ys, 
|A(L) la = V2. 
Now the determinant |A(L)| is the following polynomial in the lag operator: 


|A(L)| = [1 — €(L)If1 — £(L)] — [n(Z)]fo + AC(L)]. 


The coefficient on L° in this polynomial is unity, and the highest power of L is L¥, which 
has coefficient (&¢, — 7,A,): 


JA(L)| = 1 + aL + aL? +--+ + aL. 


Furthermore, y,, is the sum of two mutually nncorrelated MA(p) processes, and so ¥,, is 
itself MA(p). Hence, y,, follows an ARMA(2p, p) process; a similar argument shows that 
Y2, follows an ARMA (2p, p) process with the same autoregressive coefficients but different 
moving average coefficients. 

In general, consider an n-variable VAR of the form 


D(L)y, = €, 


Thus, 


with 


yn _ JQ ift=r 
E(e,e,) = {§ otherwise. 


Find the triangular factorization of © = ADA‘ and premultiply the system by A~', yielding 


A(L)y, = u, 
where 
A(L) = A~'@(L) 
u, = A-'e, 
E(uu}) = D. 


774 Appendix C | Answers to Selected Exercises 


Thus, the elements of u, are mutually uncorrelated and A(0) has 1s along its principal 
diagonal. The adjoint matrix A*(L) has the property 


A*(L)-A(L) = |A(L)|-L.- 
Premultiplying the system by A*(L), 
|A(Z)|-y, = A*(L)u,. 
The determinant |A(L)| is a scalar polynomial containing terms up to order LP, while 
elements of A*(L) contain terms up to order L‘"~, Hence, the ith row of the system takes 


the form 


|A(L)|-¥in = View 


where vy, is the sum of m mutually uncorrelated MA[(n — 1)p] processes and is therefore 
itself MA[(n — 1)p]. Hence, y,, ~ ARMA[np, (n — 1)p]. 


11.5. (a) [L — ®,z| = (1 — 0.3z)(1 — 0.4z) ~ (0.8z)(0.9z) 
1 — 0.7z — 0.62? 
= (1 - 12z)(1 + 0.52). 


Since z* = 1/1.2 is inside the unit circle, the system is nonstationary. 
_ {1 0 _ {0.3 0.8 — | 9.81 0.56 
A i | Ree EE bs ke | 


W, diverges as 5 —> ©, 


(c) Vig+2 Fa E(y.2421Y:s Yr-1> ae ) ra eine2 + 0.381541 + 0.862541 
MSE = 1 + (0.3)? + (0.8)°(2) = 2.37. 


The fraction due to ¢, = 1.09/2.37 = 0.46. 


Chapter 12. Bayesian Analysis 


12.1. Take k = 1,X = 1,8 = yw, and M = 1/y, and notice that 11 = T and l'y = 
TY. 


Chapter 13. The Kalman Filter 


13.3. No, because vy, is not white noise. 


13.5. Notice that 
=2 §2 a4 eee bE GMrety 
a2 4 gag FL + OP + Ot + @ ) 
eo ree 1+ 62+ 6 4+--- + 6% 
aX _ 621+ 21) 
th 620+) 


me! a, g~7r+2I) 


= 1 — gre} 


. 670 ( +21 be 1) 


~ g4e+2} — g2 


_ o%(1 — 6%) 


i 1- gle 
=g2 + OD. a1. 


Answers for Chapter 13 775 


Furthermore, from [13.3.19], 
6é,, = {467/67 + Op] {y, — w — 68, 1,-1} 
= {0-'@o[o? + Op} fy, — B- 6, 44,-1} 
= {807/[o? + Op} fy, — wo — 66,-11,-1); 
which is the same difference equation that generates {6é,,,}, with both sequences, of course, 
beginning with O€ 9 = @& jo = 0. With the sequences (H’P,,,,,H + R) and A‘x,,1 + 


H’é,, ,,, identical for the two representations, the likelihood in [13.4.1] to [13.4.3] must be 
identical. 


13.6. The innovation ¢, in [13.5.22] will be fundamental when|@ — K| <1. From [13.5.25], 
we see that 


6 — K = ¢03,(07, + P). 
Since P is a variance, it follows that P = 0, andso|¢ — K| =|@l, which is specified to be 
less than unity. This arises as a consequence of the general result in Proposition 13.2 that 


the eigenvalues of F — KH’ lie inside the unit circle. 
From (13.5.23] and the preceding expression for @ — K, 


—(@ — K)E(e?) = —(¢ — K)(oy + P) = -do%, 
as claimed. Furthermore, 
(1 + (@ — K)JE(e?) = (0% + P) + ( -— K)do% 
= (1+ ¢2)0} + P — Koo. 
But from [13.5.24] and [13.5.25], 
P = Kooi + 0, 
and so 
(1 + (6 — KYJE(e?) = (1 + Pow + oF. 
To understand these formulas from the perspective of the formulas in Chapter 4, note 
that the model adds an AR(1) process to white noise, producing an ARMA(1, 1) process: 


(1 OL) ¥.41 = Yar + (L - OL)w 41. 


The first autocovariance of the MA(1) process on the right side of this expression is 
— ¢o3,, while the variance is (1 + ¢?)o% + a2. 


Chapter 16. Processes with Deterministic Time Trends 
T ~ T 2 
16.1. E(w) D (A, + A(vT)Pe? — (UT) Dd) o°[A2 + 2A,A,(t/T) + nau) 
tml t= 
T 
= (1/T?) >) [A2 + 2A,A,(/T) + A/T PP: E(e? - 07). 
ml 
But 
; 
(UT) > [A2 + 2A A/T) + ATPL > M <, 
twed 
and thus 
Tr T 2 
re( ar) > [Ar + Afv/T)Pe? — (VT) DS) ofA? + 2A,A,(/T) + awry) 
tm t=1 


—> M-E(e? — a7)? < 0m, 
Thus 


(VT) 3 [A, + A,(t/T)Pe2 


+ 

3 (UT) S ofA? + 2A,A,(0/T) + A/T) 
i=l 

a7 X'QA. 


776 Appendix C | Answers to Selected Exercises 


16.2. Recall that the variance of b; is given by 


T -1 
E(b; — B)(br - B)' = oS 2x;] 
33 T T(T + 1)2 “ 
=o 
T(T +12 T(T + 1)QT + 16 
Pre- and postmultiplying by Y; results in 
E[Y;(br — B)(b; — B)'Y7] 


a T T(T + 1)2 mt 
T'LT(T + 12 T(T + 1)2T + 16| *7 


seta spall. 1(T + 1)2 ta 
em {vs bee +1)2 T(T + 1)2T + fale } 


1 3]° 
=} 1] R 


The (2, 2) element of this matrix expression holds that 
E(T?7(6, -— 8)? > 120, 


and so 
T(3,; — 8) => 0. 
16.3. Notice that 


[r > wr)»,] = T{(VT)y, + (T)y, + +++ + (TIT)y7] 
x ((VT)y, + (2/T)y, + +++ + (TIT)y7, 
which has expectation 
afr S wry, 
= 7? {{QTy + QTY +--+ + (TITY Ive 
+ [Q/T)Q/T) + QIT)G/T) + +++ + [(T - 1I/TKTIT)]2y, 
+ [(U/T)(B/T) + (/T)A4T) + +++ + [(T - 2VTITIT) 2» 
+++ + ((UT)TT)2yr-1} 
= T-'{|yol + 2[n] + 2lyel + +++ + 2] yr-al} 


0. 


Chapter 17. Univariate Processes with Unit Roots 
baat 2. 2. 
17.2. (a) T(ér- 1) = Sat + 302: [WP - wl 
Vira. h2- (WP dr 


from Proposition 17.3(e) and (h). 
(b) T?-63, = T?-s} + (Sy}-1) 
= s} + (T-*3y2,) 
++ (x. | ovor ir), 
from Proposition 17.3(h) and [17.6.10]. 
(©) t = Thy — 1) + (T?+65)” s 
4 A? (WODP ~ yo x (of (WO) ar) + (%)", 
A2+} (WO)P ar 


Answers for Chapter 17 


from answers (a) and (b). This, in turn, can be written 


HA?-[WOL)P — yo} H[WO)P - 1h 30? — w) 


ff ewer ar} : {fowor ah” eff owerp ar} | 


(d) (72-63, + 53) = 1(T~?3y2..) > uf | [Wo)P ar), 


(A?/¥q)'"” (A7/y0)'? 


from Proposition 17.3(h). Thus, 
Tbr — 1) ~ XT?-63, + 53) — 0) 


LAs T(br an 1) ae 42 aE Yo) 
2» [WOR dr 
4 Ae (WODP = vob _ _ HA? = 90) 


Az { (Wi)P dr i [Wn dr 


7 A(wP — 1} 


[wor a 


’ 


with the next-to-last line following from answer (a). 
(©) (o/A2)!2 tp — (A? — yo)/A} x {T +65, + Sr} 


4, | HWP =, __3@? = 9) 
{ i [WP ar} vf i [WO)P ar} 
= {ane = wa} + (x. [ wor a) |, 


from answers (c) and (b). Adding these terms produces the desired result. 
To estimate yp and A, one could use 


i] 


* 
T" 3% 40,., forj=0,1,...,4 


jel 


4 
B= % + 22 (1 -— Hq + DM, 


where 2, is the OLS sample residual and q is the number of autocovariances used to represent 
the serial correlation of w#(L)e,. The statistic in (d) can then be compared with the case 1 
entries of Table B.5, while the statistic in (e) can be compared with the case 1 entries of 
Table B.6. 


1 T~*3é 21 T-735t 
17.3. (a) | 77*3é,., T7722, TEE 
T-*3t T 73? Sté,- T3502 


1 Ae | W(r) dr 1/2 
4] a | Wr) dr 22> | [WNP dr A | rW(r) dr 


12 a-[ rW(r) dr 1/3 


778 Appendix C | Answers to Selected Exercises 


T-'?3u, A-W(1) 
(b) | T-*3g_,u,} > | C/A (WOP — ro} 


T-??Stu, A{W() - [ wo dr} 


(c) This follows from expression [17.4.52] and answers (a) and (b). 
(d) The calculations are virtually identical to those in [17.4.54]. 

(€) ty = Tbr — 1) + {1-65} > T(r — 1) + {HX OY". 
(f) Answer (c) establishes that 


Tr = 1) 
1 | Wr) dr 1/2 oe W(1) 
440 1 9 | W(r) dr i [W(n)F dr | rW(r) dr x(W)P - 1} 
v2 | rW(r) dr 1/3 W(1) - { Wr) dr 

1 | W(r) dr nm | 

0 

+ H{1 — (yo/A2)H0 1 0} [ ww dr | wor dr { rW(r) dr ; 

0 


1/2 { rW(r) dr 1/3 
= V+ HL — (yo/A)}0. 
Moreover, answer (d) implies that 


i(T?+8}, + 53)-(? - 0) > MOIA2)(A? — y) 
= {1 - (p/A2)}0. 
(g) From answers (d) and (e), 
(YolA?)'? +t — {2(A? — yo)A} x {T-Oy, + sr} 
* Tbr — WIVE — {402 — y)/A} x VOIA 
= {T(6, - 1) — 1(Q/A) 7 — »)} + VO 
SV+ VO,~7 
from the analysis of (f). 
To estimate yp and A, one could use 


a ; 
4% =T' > 0a,, forj=0,1,...,¢ 
tmjel 


d2 


4p + > (1 - q+ DK, 


where a, is the OLS sample residual and q is the number of autocovariances used to 
approximate the dynamics of y(L)e, The statistic in (f) can then be compared with the case 
4 entries of Table B.5, while the statistic in (g) can be compared with the case 4 entries of 
Table B.6. 


17.4. (b) Case 1 of Table B.5 is appropriate asymptotically. (c) Case 1 of Table B.6 is 
appropriate asymptotically. 


Chapter 18. Unit Roots in Multivariate Time Series 
18.1. Under the null hypothesis RB = r, we have 


x4 = [Roby — B)I'[s3REx,x)'R’] [R(; - B)] 
= [VT-R(by — B)I'[s¢VT-RExx!)-'VT-R'| '[VT-R(br ~ BDL 


Answers for Chapter 18 779 


For Y; the (k = k) matrix defined in [18.2.18] and R of the specified form, observe that 
VT-R = RY;. Thus, 


x4 = [RY;(br — B)]'[5}RYCx,x/)"'Y,R'] [RY;(b; — B)] 
= [RY+(br — B))[5¢R(VF'Sx,x/¥7")1R’][RY7(b, - B)] 


U -1 
L V-'h v-' 0 Ta, Voth 
(mle ]) (mls ole) (#Loom) 
= (R,V~*h,)'(9,,R:V~*R;)~'(R,V~"h), 


where the indicated convergence follows from [18.2.25], [18.2.20], and consistency of s}. 
Since h, ~ N(0, a;;V), it follows that 


R,V-'h, ~ N(0, ¢,R,V~'R}). 
Hence, from Proposition 8.1, the asymptotic distribution of y} is y2(m). 
18.2. Here 


-1 
x4 = (Rbz)'[s}R(Sx,x;)"'R’] (Rb), 


where x, is as defined in Exercise 18.1 and 


(I,-, @ R,) 0 
(me Vxnio= 1) ta(p=1)x(a41)) 


(npp Xk) 2 
(nz x nlp - VD} fag x(n+1)) 


= 0 I, 
R, = laos ky| 


(nz Xn) 


-/|o R 
bah Ee (m xn) |° 


From the results of Exercise 18.1, 


24 (ppv m]) v0 DV fv 
a Chea) CoM o-®) (5-2) 
oA bigger eeceel Wic 0 if 
R,Q-"h, a 0 R,Q-'R; 
x he ® | 
R,Q“"h, ' 


18.3. (a) The null hypothesis is that @ = 1 and y = a = y = 0, in which case Ay,, = 
ey and u, = Ey. Let x, = (Ex, 1, Yiye-1 You-1)' and 


T2 0 00 
_|0 T® 00 
Yr=1 gq 0 T of 
0 0 OT 
Then 
Ys'Sx,x/¥z! 
T~'Xe2, T'Xe,, To Zeit To 3726 2Yor~1 
a T'Se, 1 T- Ey -1 T~ #3 yo 0-1 


T7872 y\ Ex TZ iy T-*Zy2,-1 To 72. Yoe-2 
T7304 18 TOE y.y-¢ TOE yor Yay-i T~72y3,-1 


4 | 92 0 
0 Q|’ 


780 Appendix C | Answers to Selected Exercises 


where 


1 of W,(r) dr o,- [Wl dr 
aslo [mma otf I@Pd — oy | POLIO] & 


ox | Wal) dr oxe.-[ (A) IWA] dro | [WAP ar 


and . 
T- VIE EE 1, p 
T- Ze, zu |hy 
Y>;'Sx,u, = > 
ee TO 2 n e180 h, |’ 
To Dy 20-18 te 


and where h, ~ N(0, 303) and the second and third elements of the (3 x 1) vector h, 
have a nonstandard distribution. Hence, 


Yr(br - B) = pi re tated eS ieee 
s[3 a [kl 
[a 
.~ 1Q-th, |’ 
(b) Let e, denote the first column of the (4 x 4) identity matrix. Then 
tr = Hy + (sh ei(Sxxi)- te} 
= Ty, + {sper¥(Sxx7)"'Y rey} 
= T'4, + {sper(V7!x,x/¥7")-'e}” 
orm + fmf )'s)" 
= h(o,02) ~ N(0, 1). 


(c) Recall that 3; = fr — V7, where fis O,(T~') and ¥,is O,(T~ '”). Under the 
null hypothesis, all three values are zero; hence, 


78,5 -T¥,, 


which is asymptotically Gaussian. The t test of 6 = 0 is asymptotically equivalent to the t 
test of y = 0. 


Chapter 19. Cointegration 
19.1, (a) The OLS estimates are given by 


[*] a [ T Sela yu 
Vr Zyy, Dy}, Lyi} 
Fer Grae Para 
Vr — Yo Sy, Ey}, Lyte “Sy, 
a T el Lyne — YoYu) 
Lyx Dyk) L2ye(¥un — Yo¥an) 


Answers for Chapter 19 781 


from which 


T-2 9 & ]_[T-7 0 Le i T-2 QO 7" 
0 T’* \L4¥r — Yo 0 TY? || Syn Dyd, 0 tas 
% es 0 lI ZY — YoYa) 
-52 
0 T Dau — YoYo) 
T-¥ | T SyxJ[T? 0 J] 
O T-*]) Sy, Ty3}, 0 T- 2 
2 ies 0 lI Eu — oye) | 
-52 
0 T Zour — YoY) 


-1 
7 1 aor | T-?7E(Yyu — Yo¥n) 
T*3y2, T~*Xy}, TT? 3yo(¥i — YoYo) 


But 
Lyx = Tyro + 6: St + Zé, , 
asics oS aoc 
OfT) oT) — Orr?) 


and thus T-23y,,-> T~78,:3t — 6/2. Similarly, T-33y3,—> T-383:2@ — 82/3. Further- 
more, 


ZVie — Yo¥n) = TYr.0 — Yo¥20) + UE — Yok2)> 
ss a 
O,(T) OAT*) 


” 


establishing that T->S(y,, — yoyo,) —> T~>?5(Es — Yof>,). Similarly, 
Lal(Yre — Yo¥a) = V(Y20 + Sat + )(Yr.0 + be — YoY20 — Yok2) 


and T-5?Zy.(yie — YoYar) + T~*?8t (be — Yok.) 


(b) Ad = (yy — Gr — Vr¥a) — Wiser — &r — VrY20-1) 
= Ayy - FrAyn 
4 AYn — YoAY2 


« a P. 
since ¥r—> Yo 


19.2. Proposition 18.1 is used to show that 


TB; — B) T-'Sww, T-'Sw, T-3wyz] '[T-SwZ, 
T' (a, —a)} =| T-'Sw 1 T-*3y;, T- "Zz, 
Tr - ¥) T-*3y,w, T-3*Xy, T-*2ynyo T-3y2,2, 

Q 0 0 in 

4} 0 1 {f [W.(r)]’ arhis 
0 Ay | W,(r) dr haf f [W.(r)}-[W2()] arhis 
h, 
x Au W,(1) : 


hn{ f W200) ano} in 


as claimed. 


782 Appendix C | Answers to Selected Exercises 


19.3. Notice as in [19.3.13] that under the null hypothesis, 
x7 = {RYT Ar — vy {s40 0 RJ 


T-'Sw,wl  T-'Sw, To? 2w,y3,)7'T 07, -1 
x | T-'Sw! 1 T-*3y;, | | 0’ } {Ry Tr — vy} 


T- 3S y2,w, TO **2y, T~72yxYud [R, 


%. 


a root 6 R,] 
Q 0 0 vy 
on- 
x 0’ 1 {fiver aba 0’ (R, Av], 
0 Auf wand aff wontwaor arhic 


from which (19.3.25] follows immediately. 


19.4. 
“1 
Th; — B) T"'Sw,w! =T-'Sw, T-=w,yi, T-22w,t 
T'%(a,— a)|_ T-'Sw! 1 T-*3yi, Tt 
T(¥r - ¥) T-#23y2,w, T-3*3y T*2yuyn T-*Zyut 
T?2(6, — 6) T-2Stw/ T-t = T-Sty,, = T3522 
T-22w,u, 
T- "Su, 
T~ "Zyott, 
T-323tu, 
0 0 0 = 
0’ 1 { | [W.(r)]’ arkit 1/2 
s| 7° i 
0 Ay i W.(r) dr haf | [W.27)]}-[W.0)]’ aris An { rW,(r) dr 
0’ 12 { | r[W.(r)]’ ari 1/3 
h, 
4nW,(1) 
x - 
half ewan abi 
inf ) m= [ mo ar} 
as claimed. 


Chapter 20. Full-Information Maximum Likelihood Analysis 
of Cointegrated Systems 


20.1. Form the Lagrangean 
kidyxa + 24 (1 — kidyyk,) + #(1 — afd,,a,), 


Answers for Chapter 20 783 


with yz, and yz, Lagrange multipliers. First-order conditions are 
(a) Zyxa, = 22k, 
(Bb) Sxvk, = 2p 2xxah- 
Premultiply (a) by k, and (b) by a, to deduce that 
2p, = 2m, =n. 
Next, premultiply (a) by r7 "2; and substitute the result into (b): 
LxvlyyDyx8 = TDA 
or 
ZxkZxylyy yx = fas. 
Thus, r7 is an eigenvalue of 35) Dyy2yy' Dyx with a, the associated eigenvector, as claimed. 


Similarly, premultiplying (b) by ry'S,g) and substituting the result into (a) reveals 
that 


Vy lyxLxglxvk, = rik,. 
20.2. The restriction when h = 0 is that f, = 0. In this case, [20.3.2] would be 
£5 = —(Tnl2) log(2m) — (Tn/2) — (TI2) log|Suul 


where Zyy is the variance-covariance matrix for the residuals of [20.2.4]. This will be 
recognized from expression (11.1.32] as the maximum value attained for the log likelihood 
for the model 


Ay, = To + ThAy,_, + ILAy,..2 foe TE, _AY,-p41 + u,, 
as claimed. 
20.3. The residuals &, are the same as the residuals from an unrestricted regression of 0, 
on ¥,. The MSE matrix for the latter regression is given by Syy — Syy2vVSvy. Thus, 
[Xcel a [Zou i Zuv2w2yul | . 
Zul ‘|L, — LodZw2wW2vol 
, Zool TT 4, 
where 6, denotes the ith eigenvalue of I, — SH2wEn dvu- Recalling that A, is an eigenvalue 
of ZodZuv2vi2vy associated with the eigenvector k,, we have that 


[. 7 Sad SitSe = (1- Adk, 

so that 6, = (1 — A,) is an eigenvalue of I, — LadlwEr2vy and 
[Sas = Zou! — A). 
Hence, the two expressions are equivalent. 
20.4. Here, A, is the scalar 
A, = bt pe a ae 
and the test statistic is 
—T log(1 — A,) = —T log((Sad)- (Suv -— SwEwSvz)]- 

But 2, is the residual from a regression of Ay, on a constant and Ay,_,, Ay,-2,... , Ay,- p+ts 
meaning that 2), = d3. Likewise, 9, is the residual from a regression of y,_, on Ay,_y, 
Ay,-2) - ~~, Ay,- pet The residual from a regression of 2, on ¥,, whose average squared 
value is given by (2uu — LuvZwZyo), is the same as the residual from a regression of y, 


on a constant, y,.,, and Ay,_,, Ay,_2,. . . , AY;-p+1, Whose average squared value is denoted 
dé: 


Cov — SwSr2ve) = 6. 
Hence, the test statistic is equivalent to T[log(d3) — log(é?], as claimed. 


784 Appendix C | Answers to Selected Exercises 


Chapter 22. Modeling Time Series with Changes in Regime 
22.4. PT = [ Pu = | i — Pn)(2 — Pu — Px) 7] 
1- pu Pr (i al Pu)(2 — Pu — Px) 1 


Pu 1- pu ) 
l= tH OOO 1- = 
( Px) (; —Pu- Pn 2-pu- pe Pan oe 


1 — Pr Pr 
(1 — pu) (= +) ~1+ piu t+ pa 
= FB — Pu(2 — pr — pr) Fea! 

(1 — pu(2 — pu — P22) Az 
= TA. 


Answers for Chapter 22 785 


D 


Greek Letters and 


Mathematical Symbols 
Used in the Text 


Greek letters and common interpretation 


population linear projection coefficient (page 74) 
population regression coefficient (page 200) 


autocovariance matrix for vector process (page 261) 
autocovariance for scalar process (page 45) 

change in value of variable (page 436) 

small number; 

coefficient on time trend (page 435) 


aso 
3 
8 
3 
= 
HP V2"ITWA 


epsilon « a white noise variable (page 47) 
zeta ¢ constant term in ARCH specification (page 658) 
eta 1  AR(@) coefficient (page 79) 
theta © matrix of moving average coefficients (page 262) 
@ vector of population parameters (page 747) 
@ scalar MA(q) coefficient (page 50) 
kappa « __ kernel (page 165) 
lambda A_ matrix of eigenvalues (page 730) 
A _ individual eigenvalue (page 729) 
Lagrange multiplier (page 135) 
mu population mean (page 739) 
nu v degrees of freedom (page 409) 
xi = matrix of derivatives (page 339) 
— state vector (pages 7 and 372) 
pi II __ product (page 747) 
a the number 3.14159... 
rho p autocorrelation (page 49) 
autoregressive coefficient (page 517) 
sigma > summation 
= long-run variance-covariance matrix (page 614) 
@ population standard deviation (page 745) 
tau 7 time index 
upsilon Y _ scaling matrix to calculate asymptotic distributions (page 457) 
phi ® = matrix of autoregressive coefficients (page 257) 
@ scalar autoregressive coefficient (page 53) 
chi X  avariable with a chi-square distribution (page 746) 
psi W matrix of moving average coefficients for vector MA(=) process (page 262) 
w moving average coefficient for scalar MA(™) process (page 52) 
omega QQ variance-covariance matrix (page 748) 
w frequency (page 153) 


786 


b or by 


sith rw MP ae 


0,(7) 


Ov 


Common uses of other letters 


number of elements of unknown parameter vector (page 135) 
estimated OLS regression coefficient based on sample of size T (page 75) 


vector of constant terms for vector autoregression (page 257) 
constant term in univariate autoregression (page 53) 


the ith column of the identity matrix 
the base for the natural logarithms (page 715) 


the (mn x n) identity matrix (page 722) 
the square root of negative one (page 708) 


value of Lagrangean (page 135) 
the number of explanatory variables in a regression 


the lag operator (page 26) 
value of log likelihood function (page 747) 


number of variables observed at date ¢ in a vector system (page 257) 
order T in probability (page 460) 


MSE matrix for inference about state vector (page 378) 
the order of an autoregressive process (page 58) 


limiting value of (X’X/T) for X the (T x k) matrix of explanatory variables for 
an OLS regression (page 208); variance-covariance matrix of disturbances in 
state equation (page 373) 

the order of a moving average process (page 50); number of autocovariances 
used in Newey-West estimate (page 281) 


variance-covariance matrix of disturbances in observation equation (page 373) 
the set consisting of all real n-dimensional vectors (page 737) 

number of variables in state equation (page 372); index of date for a 
continuous-time process 


unbiased estimate of residual variance for an OLS regression with sample of 
size T (page 203) 
state at date ¢ for a Markov chain 


the number of dates included in a sample 

(T x k) matrix of explanatory variables for an OLS regression (page 201) 
history of observations through date rt (page 143) 

argument of autocovariance generating function (page 61) 


Mathematical symbols 


aleph (first letter of the Hebrew alphabet), used for matrix of regression 
coefficients (page 636) 


the number e (the base for natural logarithms) raised to the x power (page 715) 
natural logarithm of x (page 717) , 

x factorial (page 713) 

annihilation operator (page 78) 

greatest integer less than or equal to x 

absolute value of a real scalar or modulus of a complex scalar x (page 709) 
determinant of a square matrix X (page 724) 

transpose of the matrix X (page 723) 

an (m X m) matrix of zeros 

a (1 x n) row vector of zeros 

gradient vector (page 735) 

Kronecker product (page 732) 


© 

ysx 

y= x 
max{y,x} 
ae f(r) 
xEA 
ACB 

P{A} 

fr(y) 

Y ~ N(u,07) 
Y = N(yp,o? 
E(X) 
Var(X) 
Cov(X, Y) 
Cort(X, Y) 
Y|X 

P(YIX) 
E(¥|X) 
Drvate 


477 y 
try 
mus. 
tr y 
tL 
try 


X(-) > x(-) 


x7(-) > x(-) 


element-by-element multiplication (page 692) 

y is approximately equal to x 

y is defined to be the value represented by x 

the value given by the larger of y or x 

y is the smallest number such that y = f(r) for all r in [0,1] (page 481) 


x is an element of A 

Ais a subset of B (page 189) 

probability that event A occurs (page 739) 

probability density of the random variable Y (page 739) 
Y has a N(#t,07) distribution (page 745) 

Y has a distribution that is approximately N(,07) (page 210) 
expectation of X (page 740) 

variance of X (page 740) 

covariance between X and Y (page 742) 

correlation between X and Y (page 743) 

Y conditional on X (page 741) 

linear projection of Y on X (pages 74-75) 

linear projection of Y on X and a constant (pages 74—75) 


linear projection of y,,, on a constant and a set of variables observed at 
date t (page 74) 


lim x; = y (page 180) 
tT. 


x, converges in probability to y (pages 181, 749) 
x, converges in mean square to y (pages 182, 749) 
x, converges in distribution to y (page 184) 


the sequence of functions whose value at r is x;(r) converges in 
probability to the function whose value at r is x(r) (page 481) 


the sequence of functions whose value at r is x,(r) converges in 
probability law to the function whose value at r is x(r) (page 481) 


788 Appendix D | Symbols Used in the Text 


Author Index 


A 


Ahn, S. K., 601, 630 

Almon, Shirley, 360 

Amemiya, Takeshi, 227, 421 

Anderson, Brian D. O., 47, 373, 403 

Anderson, T. W., 152, 195 

Andrews, Donald W. K., 190, 191, 197, 284, 
285, 412, 425, 513, 533, 618, 698 

Angrist, Joshua D., 253 

Ashley, Richard, 109 


B 


Baillie, Richard T., 336, 573, 662, 666 

Barro, Robert J., 361 

Bates, Charles, 427, 664 

Bendt, E. K., 142, 661 

Bera, A. K., 670, 672 

Bernanke, Ben, 330, 335 

Betancourt, Roger, 226 

Beveridge, Stephen, 504 

Bhargava, A., 532 

Billingsley, Patrick, 481 

Black, Fischer, 672 

Blanchard, Olivier, 330, 335 

Bloomfield, Peter, 152 

Blough, Stephen R., 445, 562 

Bollerslev, Tim, 658, 661, 662, 663, 665, 666, 
667, 668, 670, 671, 672 

Bouissou, M. B., 305 

Box, George E. P., 72, 109-10, 111, 126, 132, 
133 


Brock, W. A., 479 
Broyden, C. G., 142 
Burmeister, Edwin, 386, 388 
Butler, J. S., 664 


Cc 


Cai, Jun, 662 

Caines, Peter E., 387, 388 
Campbell, John Y., 444, 516, 573 
Cao, Charles Q., 666 

Cecchetti, Stephen G., 532 
Chan, N. H., 483, 532 

Chiang, Alpha C., 135, 704 
Chiang, Chin Long, 19n, 23n 
Cho, Dongchul, 672 

Choi, B., 532, 601 

Chou, Ray Y., 658, 665, 668 
Christiano, Lawrence J., 305, 445, 447, 653 
Clarida, Richard, 573 


Clark, Peter K., 444 

Cochrane, John H., 444, 445, 531 
Cooley, Thomas F., 335 

Corbae, Dean, 573 

Cox, D. R., 126, 681, 685 

Cramér, Harald, 157, 184, 411, 427 


D 


Davidon, W. C., 139-42 

Davidson, James E. H., 571, 572, 581 

Davies, R. B., 698 

Day, Theodore E., 672 

DeGroot, Morris H., 355, 362 

DeJong, David N., 533 

Dempster, A. P., 387, 689 

Dent, Warren, 305 

Diamond, Peter, 335 

Dickey, David A., 475, 483, 493, 506, 516, 
530, 532 

Diebold, Francis X., 361, 449, 671, 690, 691 

Doan, Thomas A., 362, 402-3 

Durbin, James, 226 

Durland, J. Michael, 691 

Durlauf, Steven N., 444, 547, 587 


E 


Edison, Hali J., 672 

Eichenbaum, Martin, 445 

Eicker, F., 218 

Engle, Robert F., 227, 387, 571, 575, 596, 661 
664, 667, 668, 670, 671, 672, 698 

Evans, G. B. A., 216, 488, 516 

Everitt, B. S., 689 


F 


Fair, Ray C., 412, 425, 426 

Fama, Eugene F., 306, 360, 376 

Feige, Edgar L., 305 

Ferguson, T. S.,:411 

Ferson, Wayne E., 664 

Filardo, Andrew J., 690 

Fisher, Franklin M., 246, 247 

Flavin, Marjorie A., 216, 335 

Fletcher, R., 139-42 

Friedman, Milton, 440 

Fuller, Wayne A., 152, 153, 164, 275, 454, 
464, 475, 476, 483, 488, 493, 506, 516 

G 


Gagnon, Joseph E. ; 444 
Galbraith, J. I., 124, 133 


789 


Galbraith, R. F., 124, 133 

Gallant, A. Ronald, 283, 412, 421, 427, 431, 
672 

Garber, Peter M., 426 

Gevers, M., 388 

Geweke, John, 305, 313-14, 365, 449, 670 

Ghosh, Damayanti, 389 

Ghysels, Eric, 426 

Giannini, Carlo, 334, 336, 650 

Gibbons, Michael R., 376 

Glosten, Lawrence R., 663, 669, 672 

Goldfeld, Stephen M., 1, 3 

Gonzalez-Rivera, Gloria, 672 

Goodwin, Thomas H., 698 

Gourieroux, C., 431, 670 

Granger, C. W. J., 126, 291, 302-9, 448, 449, 
557, 571, 581-82 

Gregory, Allan W., 214 

Griffiths, William E., 142, 148, 227 


fe 


Hall, Alastair, 426, 427, 530, 532 

Hall, B. H., 142, 661 

Hall, P., 481, 482 

Hall, R. E., 142, 361, 661 

Hamilton, James D., 307, 388, 397, 444, 662, 
689, 695, 698 

Hammersley, J. M., 365 

Hand, D. J., 689 

Handscomb, D. C., 365 

Hannan, E. J., 477133, 388 

Hansen, Bruce E., 589, 596, 601, 613-18, 651, 
698 

Hansen, Lars P., 67, 218, 280, 335, 409, 411, 
412, 414, 415, 419, 421, 422, 424 

Harvey, A. C., 226, 386 

Haug, Alfred A., 596 

Haugh, Larry D., 305 

Hausman, J. A., 142, 247, 661 

Hendry, David F., 571, 572, 581 

Heyde, C. C., 481, 482 

Higgins, M. L., 670, 672 

Hill, R. Carter, 142, 148, 227 

Hoel, Paul G., 704 

Hoerl, A. E., 355 

Hong, Y. S., 671, 672 

Hood, William C., 638 

Hosking, J. R. M., 448 

Hsieh, David A., 662, 672 


I 
Imhof, J. P., 216 


J 


Jagannathan, R., 663, 669, 672 

Janacek, G. J., 127 

Jeffreys, H., 533 

Jenkins, Gwilym M., 72, 109-10, 111, 132, 
133 


Johansen, Sgren, 590, 601, 635-36, 640, 646, 
649, 650, 651 

Johnston, J., 704 

Jorgenson, D. W., 421 

Jorion, Philippe, 662 

Joyeux, Roselyne, 448 

Judge, George G., 142, 148, 227 


K 


Kadane, Joseph B., 363-65 
Kalman, R. E., 372 


790 Author Index 


Kane, Alex, 672 

Keating, John W., 335 
Kelejian, Harry, 226 
Kennard, R. W., 355 
Kiefer, Nicholas M., 689 
Kim, K., 513, 516 
Kinderman, A. J., 216 
King, Robert G., 426, 573 
Kloek, T., 365 
Kocherlakota, N. R., 426 
Koopmans, T. C., 638 
Koreisha, Sergio, 133 
Kremers, J. J. M., 573 
Kroner, Kenneth F., 658, 665, 668, 670 
Kwiatkowski, Denis, 532 


L 


Laffont, J. J., 305, 421 
Lafontaine, Francine, 214 

Laird, N. M., 387, 689 

Lam, Pok-sang, 450, 532, 691 
Lamoureux, Christopher G., 672 
Lastrapes, William D., 672 
Leadbetter, M. R., 157 

Leamer, Edward, 335, 355 

Lee, Joon-Haeng, 690, 691 

Lee, Tsoung-Chao, 142, 148, 227 
LeRoy, Stephen F., 335 

Lewis, Craig M., 672 

Li, W. K., 127 

Lilien, David M., 667, 672 
Litterman, Robert B., 360-62, 402-3 
Ljungqvist, Lars, 305, 447, 653 
Lo, Andrew W., 449, 531, 532 
Loretan, Mico, 608, 613 

Lucas, Robert E., Jr., 306 
Liitkepohl, Helmut, 336, 339 


M 


McCurdy, Thomas H., 691 
MacKinlay, A. Craig, 531, 532 
McLeod, A. I., 127 

Maddala, G. S., 250 

Magnus, Jan R., 302, 317, 318, 704 
Makov, U. E., 689 
Malinvaud, E., 411 

Malliaris, A. G., 479 

Mankiw, N. Gregory, 361, 444 
Mark, Nelson, 664 

Marsden, Jerrold E., 196, 704 
Martin, R. D., 127 

Meese, Richard, 305 

Milhgj, Anders, 662, 670 
Miller, H. D., 681, 685 
Monahan, J. Christopher, 284, 285, 618 
Monfort, A., 431, 670 

Moore, John B., 47, 373, 403 
Mosconi, Rocco, 650 

Mustafa, C., 672 

Muth, John F., 440 


N 


Nason, James A., 361 

Nelson, Charles R., 109, 253, 426, 444, 504 
Nelson, Daniel B., 662, 666, 667, 668, 672 
Nelson, Harold L., 126 

Nerlove, Mark, 671 

Neudecker, Heinz, 302, 704 

Newbold, Paul, 557 

Newey, Whitney K., 220, 281-82, 284, 414 


Ng, Victor K., 668, 671, 672 
Nicholls, D. F., 218 
Noh, Jason, 672 


O 


Ogaki, Masao, 424, 573, 575, 618, 651 
Ohanian, Lee E., 554 

O’Nan, Michael, 203, 704 

Quliaris, Sam, 573, 593, 601, 630 


P 


Pagan, A. R., 218, 389, 668, 671, 672 

Pantula, S. G., 532, 670 

Park, Joon Y., 214, 483, 532, 547, 549, 573 
575, 601, 618, 651 

Pearce, Douglas K., 305 

Perron, Pierre, 449, 475, 506-16 

Phillips, G. D. A., 386 

Phillips, Peter C. B., 195, 214, 475, 483, 487, 
506-16, 532, 533, 534, 545, 547, 549, 
554, 557, 576-78, 587, 593, 601, 608, 
613-18, 630, 650, 651 

Pierce, David A., 305 

Ploberger, Werner, 698 

Plosser, Charles I., 444, 573 

Port, Sidney C., 749 

Porter-Hudak, Susan, 449 

Powell, M. J. D., 139-42 

Pukkila, Tarmo, 133 


Q 


Quah, Danny, 335 
Quandt, Richard E., 142 


R 


Ramage, J. G., 216 

Rao, C. R., 52, 183, 184, 204 

Rappoport, Peter, 449 

Raymond, Jennie, 664 

Reichlin, Lucrezia, 449 

Reinsel, G. C., 601, 630 

Rich, Robert W., 664 

Rissanen, J., 133 

Robins, Russell P., 667, 672 

Rogers, John H., 677 

Rothenberg, Thomas J., 247, 250, 334, 362, 
388, 411 

Rothschild, Michael, 671 

Royden, H. L., 191 

Rubin, D. B., 387, 689 

Rudebusch, Glenn D., 449 

Runkle, David E., 337, 339, 663, 669, 672 

Ruud, Paul A., 250 


S 

Said, Said E., 530, 532 

Saikkonen, Pentti, 608 

Sargan, J. D., 532 

Sargent, Thomas J., 33, 34, 39, 41, 67, 78, 
109, 335, 423 

Savin, N. E., 216, 488, 516 

Schmidt, Peter, 513, 516, 532 

Scholes, Myron, 672 

Schwert, G. William, 513, 516, 668, 672 

Selover, David D., 573 

Shapiro, Matthew D., 335 

Shiller, Robert J., 360, 573 

Shin, Y., 532 


> 


Shumway, R. H., 387 

Sill, Keith, 426 

Simon, David P., 664 

Sims, Christopher A., 291, 297, 302, 304, 330, 
402-3, 445, 454, 455, 464, 483, 532-34, 
547, 549, 555 

Singleton, Kenneth J., 422, 424 

Smith, A. F. M., 689 

Solo, Victor, 195, 483, 532, 545, 547 

Sowell, Fallaw, 449, 532 

Srba, Frank, 571, 572, 581 

Startz, Richard, 253, 427 

Stinchcombe, Maxwell, 482, 698 

Stock, James H., 305, 376, 444, 445, 447, 454, 
455, 464, 483, 532, 533, 547, 549, 555, 
573, 578, 587, 601, 608, 613, 630, 653 

Stoffer, D. S., 387 

Stone, Charles J., 704 

Strang, Gilbert, 704 

Susmel, Raul, 662 

Swift, A. L., 127 


T 


Tauchen, George, 426, 427, 671 
Taylor, William E., 247 

Theil, Henri, 203, 359, 704 

Thomas, George B., Jr., 157, 166, 704 
Tierney, Luke, 363-65 

Titterington, D. M., 689 

Toda, H. Y., 554, 650 

Trognon, A., 431 


U 


Uhlig, Harald, 532-34 


V 


van Dijk, H. K., 365 
Veall, Michael R., 214 
Vuong, Q. H., 305 


W 


Wail, Kent D., 386, 388 

Watson, Mark W., 305, 330, 335, 376, 387, 
389, 444, 447, 454, 455, 464, 483, 547, 
549, 555, 573, 578, 601, 608, 613, 630, 
653 

Wei, C. Z., 483, 532 

Weinbach, Gretchen C., 690, 691 

Weiss, Andrew A., 663 

Wertz, V., 388 

West, Kenneth D., 220, 281-82, 284, 414, 
555, 647, 672 

White, Halbert, 126, 144, 145, 185, 189, 193, 
196, 218, 280, 281, 282, 412, 418, 427, 
428, 429, 431, 482, 664, 698 

White, J. S., 483 

White, Kenneth J., 214 

Whiteman, Charles H., 39, 533 

Wold, Herman, 108-9, 184 

Wooldridge, Jeffrey M., 431, 590, 591, 608, 
663, 671, 672 

Y 


Yeo, Stephen, 571, 572, 581 
Yoo, Byung Sam, 575, 596 


Z 


Zellner, Arnold, 315, 362 
Zuo, X., 672 


Author Index 791 


Subject Index 


A 


Absolute summability, 52, 64 
autocovariances and, 52 
matrix sequences and, 262, 264 
Absorbing state, 680 
Adaptive expectations, 440 
Adjoint, 727 
Aliasing, 161 
Amplitude, 708 
Andrews-Monahan standard errors, 285 
Annihilation operator, 78 
AR, See Autoregression 
ARCH. See Autoregressive conditional 
heteroskedasticity 
Argand diagram, 709 
ARIMA. See Autoregressive integrated 
moving average 
ARMA. See Autoregressive moving average 
Asset prices, 360, 422, 667 
Asymptotic distribution, See also Convergence 
autoregression and, 215 
GMM and, 414-15 
limit theorems for serially dependent 
observations, 186-95 
review of, 180-86 
time trends and, 454-60 
of 2SLS estimator, 241-42 
unit root process and, 475-77, 504-6 
vector autoregression and, 298-302 
Autocorrelation; 
of a covariance-stationary process, 49 
GLS and, 221-22 
partial, 111-12 
sample, 110-11 
Autocovariance, 45 
matrix, 261 
population spectrum and, 155 
vector autoregression and, 264-66 
Autocovariance-generating function, 61-64 
factoring, 391 
Kalman filter and, 391-94 
of sums of processes, 106 
vector processes and, 266-69 
Autoregression (AR). See also Unit root 
process; Vector autoregression 
first order, 53—56, 486-504 
forecasting, 80—82 
maximum likelihood estimation for 
Gaussian, 118-27 
parameter estimation, 215-17 


792 


pth order, 58-59 
second order, 56~58 
sums of, 107-8 
Autoregressive conditional heteroskedasticity 
(ARCH): 
ARCH-M, 667 
comparison of alternative models, 672 
EGARCH, 668-69 
GARCH, 665-67 
Gaussian disturbances, 660-61 
generalized method of moments, 664 
IGARCH, 667 
maximum likelihood, 660-62 
multivariate models, 670-71 
Nelson’s model, 668-69 
non-Gaussian disturbances, 661-62 
nonlinear specifications, 669~70 
nonparametric estimates, 671 
quasi-maximum likelihood, 663-64 
semiparametric estimates, 672 
testing for, 664-65 
Autoregressive integrated moving average 
(ARIMA), 437 
Autoregressive moving average (ARMA): 
autocovariance-generating function, 63 
autoregressive processes, 53-59 
expectations, stationarity, and ergodicity, 
43-47 
forecasting, 83-84 
invertibility, 64-68 
maximum likelihood estimation for Gaussian 
ARMA process, 132-33 
mixed processes, 59-61 
moving average processes, 48-52 
non-Gaussian, 127 
parameter estimation, 132, 387 
population spectrum for, 155 
sums of, 102-8 
white noise and, 47—48 


B 


Bandwidth, 165, 671 
Bartlett kernel, 167, 276-77 
Basis, cointegrating vectors and, 574 
Bayesian analysis: 
diffuse/improper prior, 353 
estimating mean of Gaussian distribution, 
352-53 


estimating regression model with lagged 
dependent variables, 358 


estimating regression model with unknown 


variance, 355-58 

introduction to, 351-60 

mixture distributions, 689 

Monte Carlo, 365-66 

numerical methods, 362-66 

posterior density, 352 

prior density, 351-52 

regime-switching models, 689 

unit roots, 532-34 

vector autoregression and, 360-62 
Bayes's law, 352 
Beveridge-Nelson decomposition, 504 
Bias, 741 

simultaneous equations, 233-38 
Block exogeneity, 309, 311-13 
Block triangular factorization, 98-100 
Bootstrapping, 337 
Box-Cox transformation, 126 
Box-Jenkins methods, 109-10 
Brownian motion, 477-79 

differential, 547 

standard, 478, 544 
Bubble, 38 
Business cycle frequency, 168-69 


¢ 


Calculus, 711-21 
Canonical cointegration, 618 
Canonical correlation: 
population, 630-33 
sample, 633-35 
Cauchy convergence, 69-70 
Cauchy-Schwarz inequality, 49, 745 
Central limit theorem, 185-86 
functional, 479-86 
Martingale difference sequence, 193-95 
stationary stochastic process, 195 
Chain rule, 712 
Chebyshev’s inequality, 182-83 
Chi-square ‘distribution, 746, 753 
Cholesky factorization, 91-92, 147 
Cochrane-Orcutt estimation, 224, 324 
Coefficient of relative risk aversion, 423 
Coherence, population, 275 
Cointegrating vector, 574, 648-50 
Cointegration, 571 
basis, 574 
canonical, 618 
comtegrating vector, 574, 648-50 
common trends representation (Stock- 
Watson), 578 
description of, 571-82 
error-correction representation, 580-81 
Granger representation theorem, 581-82 
hypothesis testing, 601-18 
moving average representation, 574-75 
Phillips-Ouliaris-Hansen tests, 598-99 
testing for, 582-601, 645 


triangular representation (Phillips), 576-78 


vector autoregression and, 579-80 
Cointegration, full-information maximum 
likelihood and: 
hypothesis testing, 645-50 
Johansen’s algorithm, 635-38 


motivation for auxiliary regressions, 638-39 


motivation for canonical correlations 
639-42 
motivation for parameter estimates, 
642-43 
parameter estimates, 637-38 
population canonical correlations, 630-33 
sample canonical correlations, 633-35 
without deterministic time trends, 643-45 
Complex: 
congugate, 710 
numbers, 708-11 
unit circle, 709 
Concentrated likelihood, 638 
Conditional distributions, 741~42 
Conditional expectation, 742 
for Gaussian variables, 102 
Conditional likelihood, vector autoregression 
and, 291-93 
Conjugate pair, 710 
Conjugate transposes, 734-35 
Consistent, 181, 749 
Consumption spending, -361, 572, 600-1, 
610-12, 650 
Continuity, 711 
Continuous function, 711, 735 
Continuous mapping theorem, 482-83 
Continuous time process, 478 
Convergence: 
Cauchy criterion, 69-70 
in distribution, 183-85 
Kalman filter and, 389-90 
limits of deterministic sequences, 180 
in mean square, 182~83, 749 
for numerical optimization, 134, 137 
in probability, 181-82, 749 
of random functions, 481 
ordinary, 180 
weak, 183 
Correlation: 
canonical, 630-35 
population, 743 
Cosine, 704, 706-7 
Cospectrum, 271-72 
Covariance: 
population, 742 
triangular factorization and, 114~15 
Covariance restrictions, identification and, 
246-47 
Covariance-stationary, 45—46, 258 
law of large numbers and, 186-89 
Cramér-Wold theorem, 184 
Cross spectrum, 270 
Cross validation, 671 


D 


Davidon-Fletcher-Powell, 139-42 
De Moivre’s theorem, 153, 716-17 
Density/ies, 739. See also Distribution 
unconditional, 44 

Derivative(s): 

of matrix expressions, 294, 737 

partial, 735 

second-order, 712, 736 ; 

of simple functions, 711-12 

of vector-valued functions, 737 
Determinant, 724-27 

of block diagonal matrix, 101 
Deterministic time trends. See Time trends 


Subject Index 793 


Dickey-Fuller test, 490, 502, 528-29, 762-64 
augmented, 516, 528 
F test, 494, 524 
Difference equation: 
dynamic multipliers, 2-7 
first-order, 1-7, 27-29 
pth-order, 7-20, 33-36 
repeated eigenvalues, 18-19 
second-order, 17, 29-33 
simulating, 10 
solving by recursive substitution, 1-2 
Difference stationary, 444 
Distributions, 739. See also Asymptotic 
distribution 
chi-square, 746, 753 
conditional, 741~42 
convergence in, 183-85 
F, 205-7, 357, 746, 756-60 
gamma, 355 
Gaussian, 745-46, 748-49, 751-52 
generalized error, 668 
joint, 741 
joint density-, 686 
marginal, 741 
mixture, 685-89 
Normal, 745-46, 748-49, 751-52 
posterior, 352 
prior, 351-52 
probability, 739 
t, 205, 356-57, 409-10, 746, 755 
Duplication matrix, 301 
Dynamic multipliers, 2-7, 442-44 
calculating by simulation, 2-3 


E 


Efficient estimate, 741 
Efficient markets hypothesis, 306 
Eigenvalues, 729-32 
Eigenvectors, 729-30 
Elasticity, logarithms and, 717-18 
EM algorithm, 688-89, 696 
Endogenous variables, 225~26 
Ergodicity, 46-47 
Ergodic Markov chain, 681~82 
Error-correction representation, 580-81 
Euler equations, 422 
Euler relations, 716-17 
Exchange rates, 572, 582-86, 598, 647-48 
Exclusion restrictions, 244 
Expectation, 740 

adaptive, 440 

conditional, 72-73, 742 

of infinite sum, 52 

stochastic processes and, 43-45 
Exponential functions, 714-15 
Exponential smoothing, 440 


F. 


F distribution, 205-7, 357, 746, 756-60 
Filters, 63~64, 169-72, 277-79. See also 
Katman filter 
multivariate, 264 
FIML. See Full-information maximum. 
likelihood 
First-difference operator, 436 
First-order autoregressive process, 53-56 
asymptotic distribution and, 215, 486-504 
First-order difference equations, 1-7 
lag operators and, 27-29 
First-order moving average, 48-49 


794 Subject Index 


Fisher effect, 651 
Forecasts/forecasting: 
ARMA processes, 83--84 
AR process, 80-82 
Box-Jenkins methods, 109-10 
conditional expectation and, 72-73 
finite number of observations and, 85-87 
for Gaussian processes, 100-102 
infinite number of observations and, 77-84 
Kalman filter and, 381-85 
linear projection and, 74~76, 92-100 
macroeconomic, 109 
MA process, 82-83, 95-98 
Markov chain and, 680 
for noninvertible MA, 97 
nonlinear, 73, 109 
unit root process and, 439-41 
vectors, 77 
Fractional integration, 448-49 
Frequency, 708 
Frequency domain. See Spectral analysis 
Full-information maximum likelihood (FIML), 
247-50, 331-32. See also Cointegration, 
full-information maximum likelihood 
and 
Functional central limit theorem, 479-86 
Fundamental innovation, 67, 97, 260 


G 


Gain, 275 
Kalman, 380 
Gamma distribution, 355 
Gamma function, 355 
Gaussian: 
distribution, 745-46, 748-49, 751-52 
forecasting, 100-102 
kernel, 671 
maximum likelihood estimation for Gaussian 
ARMA process, 132-33 
maximum likelihood estimation for Gaussian 
AR process, 118-27 
maximum likelihood estimation for Gaussian 
MA process, 127-31 
process, 46 
white noise, 25, 43, 48 
Gauss-Markov theorem, 203, 222 
Generalized error distribution, 668 
Generalized least squares (GLS): 
autocorrelated disturbances, 221-22 
covariance matrix and, 220-21 
estimator, 221 
heteroskedastic disturbances, 221 
maximum likelihood estimation and, 222 
Generalized method of moments (GMM): 
ARCH models, 664 
asymptotic distribution of, 414-15 
estimation by, 409-15 
estimation of dynamic rational expectation 
models, 422-24 
examples of, 415-24 
extensions, 424-27 
identification (econometric) and, 426 
information matrix equality, 429 
instrumental variable estimation, 418-20 
instruments of choice for, 426-27 
maximum likelihood estimation and, 427-31 
nonlinear systems of simultaneous equations, 
421-22 
nonstationary data, 424 
optimal weighting matrix, 412-14 


ordinary least squares and, 416-18 
orthogonality conditions, 411 
overidentifying restrictions, 415 
specification testing, 415, 424-26 
testing for structural stability, 424-26 
two-stage least squares and, 420-21 

Geometric series, 713, 732 

Global identification, 388 

Global maximum, 134, 137, 226 

GLS. See Generalized least squares 


GMM. See Generalized method of moments 


GNP. See Gross national product 
Gradient, 735-36 

Granger causality test, 302~9 
Granger representation theorem, 582 
Grid search, 133-34 


Gross national product, 112, 307, 444, 450, 


697-98. See also Business cycle 


frequency; Industrial production; 


Recessions 


H 


Hessian matrix, 139, 736 


Heteroskedasticity, 217~20, 227. See also 


Autoregressive conditional 


heteroskedasticity (ARCH); Newey- 


West estimator 


consistent standard error, 219, 282-83 


GLS and, 221 

Hélder’s inequality, 197 

Hypothesis tests: 
cointegration and, 601-18, 645-50 
efficient score, 430 
Lagrange multiplier, 145, 430 
likelihood ratio, 144-45, 296-98 
linear restrictions, 205 
nonlinear restrictions, 214, 429-30 
time trends and, 461-63 
Wald, 205, 214, 429-30 


I 


I(d). See Integrated of order d 
Idempotent, 201 
Identification, 110, 243-46 
covariance restrictions, 246-47 
exclusion restrictions, 244 
global, 388 
GMM and, 426 
just identified, 250 
Kalman filter and, 387~88 
local, 334, 388 
order condition, 244, 334 
overidentified, 250 
rank condition, 244, 334 
structural VAR, 332 
Identity matrix, 722 
iid., 746 
Imaginary number, 708 
Impulse-response function: 
calculating by simulation, 10 
orthogonalized, 322 
standard errors, 336-40 
univariate system, 5 
vector autoregression and, 318-23 
Independence: 
linear, 728, 729-30 
random variables, 742 
Industrial production, 167 
Inequalities: 
Cauchy-Schwarz, 49, 745 


Chebyshev’s, 182-83 
Holder, 197 
triangle, 70 
Inequality constraints, 146-48 
Infinite-order moving average, 51-52 
Information matrix, 143-44 
equality, 429 
Innovation, fundamental, 67 


Instrumental variable (IV) estimation, 242-43, 


418-20 
Instruments, 238, 253, 426-27 
Integrals: 


definite, 719-21 
indefinite, 718-19 
multiple, 738-39 
Integrated of order d, 437, 448 


Integrated process, 437. See also Unit root 


process 
fractional, 448-49 

Integration, 718 
constant of, 719 


Interest rates, 376, 501, 511-12, 528, 651 


Invertibility, 64-68 


IV. See Instrumental variable (IV) estimation 


J 

Jacobian matrix, 737 
Johansen’s algorithm, 635-38 
Joint density, 741 

Joint density-distribution, 686 
Jordan decomposition, 730-31 


K 


Kalman filter: 


autocovariance-generating function and, 


391-94 
background of, 372 
derivation of, 377-81 
estimating ARMA processes, 387 
forecasting and, 381-85 
gain matrix, 380 
identification, 387-88 
MA(1) process and, 381-84 
maximum likelihood estimation and, 
385-89 
parameter uncertainty, 398 
quasi-maximum likelihood and, 389 
smoothing and, 394-97 
state-space representation of dynamic 
system, 372~77 
Statistical inference with, 397-98 
steady-state, 389-94 
time-varying parameters, 399-403 
Wold representation and, 391-94 
Kernel estimates, 165-67. See also 
Nonparametric estimation 
Bartlett, 167, 276-77 
Gaussian, 671 
Parzen, 283 
quadratic spectral, 284 
Khinchine’s theorem, 183 
Kronecker product, 265, 732-33 
Kurtosis, 746 


L 
Lag operator. 


first-order difference equations and, 27-29 
initial conditions and unbounded sequences, 


36-42 
polynomial, 27 


Subject Index 795 


pth-order difference equations and, 33-36 
purpose of, 26 
second-order difference equations and, 
29~33 
Lagrange multiplier, 135, 145, 430 
Law of iterated expectations, 742 
Law of iterated projections, 81, 100 
Law of large numbers, 183, 749 
covariance-Stationary processes, 186-89 
mixingales, 190~92 
Leverage effect, 668 
Likelihood function, 746-47. See also 
Maximum likelihood estimation (MLE) 
concentrating, 638 
vector autoregression and, 291-94, 310-11 
Likelihood ratio test, 144-45, 296-98, 648-50 
Limit. See Convergence 
Linear dependence, 728-29 
Geweke’s measure of, 313-14 
Linearly deterministic, 109 
Linearly indeterministic, 109 
Linear projection: 
forecasts and, 74-76, 92-100 
multivariate, 75 
ordinary least squares regression and, 
75-76, 113~14 
properties of, 74-75 
updating, 94 
Linear regression. See also Generalized least 
squares (GLS); Generalized method of 
moments (GMM); Ordinary least 
squares (OLS) 
algebra of, 200-202 
review of OLS and i.i.d., 200-207 
Local identification, 334, 388 
Local maximum, 134, 137, 226 
Logarithms, 717-18 
Long-run effect, 6-7 
Loss function, 72 


M 


MA. See Moving average 
Markov chain, 678 
absorbing state, 680 
ergodic, 681-82 
forecasting, 680 
periodic, 685 
reducible, 680 
transition matrix, 679 
two-state, 683-84 
vector autoregressive representation, 679 
Martingale difference sequence, 189-90, 
193-95 
Matrix/matrices: 
adjoint, 727 
Conjugate transposes, 734-35 
determinant, 724-27 
diagonal, 721 
duplication, 301 
gain, 380 
geometric series, 732 
Hessian, 139, 736 
idempotent, 201 
identity, 722 
information, 143-44, 429 
inverse, 727-28 
Jacobian, 737 
Jordan decomposition, 730-31 
lower triangular, 725 


796 Subject Index 


nonsingular, 728 
optimal weighting, 412-14 
partitioned, 724 
positive definite, 733-34 
positive semidefinite, 733 
power of, 722 
singular, 728 
square, 721 
symmetric, 723 
trace of, 723-24 
transition, 679 
transposition, 723 
triangular, 729 
triangular factorization, 87 
upper triangular, 727 
Maximum likelihood estimation (MLE), 117, 
747. See also Quasi-maximum likelihood 
asymptotic properties of, 142-45, 429-30 
concentrated, 638 
conditional, 122, 125-26 
EM algorithm and, 688-89 
full-information maximum likelihood, 
247-50 
Gaussian ARMA process and, 132-33 
Gaussian AR process and, 118-27 
Gaussian MA process and, 127-31 
general coefficient constraints and, 315-18 
global maximum, 134, 137 
GLS and, 222 
GMM and, 427-31 
Kalman filter and, 385-89 
local, 134, 137 
prediction error decomposition, 122, 129 
regularity conditions, 427, 698 
standard errors for, 143-44, 429-30 
statistical inference with, 142-45 
vector autoregression and, 291-302, 309-18 
Wald test for, 429-30 
Mean: 
ergodic for the, 47 
population, 739 
sample, 186-95, 279-85, 740-41 
unconditional, 44 
Mean square, convergence in, 182-83, 749 
Mean squared error (MSE), 73 
of linear projection, 74, 75, 77 
Mean-value theorem, 196 
Mixingales, 190-92 
Mixture distribution, 685-89 
MLE. See Maximum likelihood estimation 
(MLE) 
Modulus, 709 
Moments. See also Generalized method of 
moments (GMM) 
population, 739-40, 744-45 
posterior, 363-65 
sample, 740-41 
second, 45, 92-95, 192-93 
Money demand, 1, 324 
Monte Carlo method, 216, 337, 365-66, 398 
Moving average (MA): 
cointegration and, 574-75 
first order, 48-49 
forecasting, 82-83, 95-98 
infinite order, 51-52 
maximum likelihood estimation for 
Gaussian, 127--31, 387 
parameter estimation, 132, 387 
population spectrum for, 154-55, 276 
qth order, 50-51 


sums of, 102-7 
vector, 262-64 
MSE. See Mean squared error (MSE) 


N 


Newey-West estimator, 220, 281-82 
Newton-Raphson, 138-39 
Nonparametric estimation. See also Kernel 
bandwidth, 165, 671 
conditional variance and, 671 
cross validation, 671 
population spectrum, 165-67 
Nonsingular, 728 
Nonstochastic, 739 
Normal distribution, 745-46, 748-49, 751-52 
Normalization, cointegration and, 589 
Numerical optimization: 
convergence criterion, 134, 137 
Davidon-Fletcher-Powell, 139-42 
EM algorithm, 688-89, 696 
grid search, 133-34 
inequality constraints, 146-48 
Newton-Raphson, 138-39 
numerical maximization, 133, 146 
numerical minimization, 142 
steepest ascent, 134-37 


O 


Observation equation, 373 
Oil prices, effects of, 307-8 
OLS. See Ordinary least squares 
O,. See Order in probability 
Operators: 
annihilation, 78 
first-difference, 436 
time series, 25-26 
Option prices, 672 
Order in probability, 460 
Ordinary least squares (OLS). See also 
Generalized least squares (GLS); 
Hypothesis tests; Regression 
algebra of, 75-76, 200-202 
autocorrelated disturbances, 217, 282-83 
chi-square test, 213 
distribution theory, 209, 432-33 
estimated coefficient vector, 202-3 
F test, 205-7 
GMM and, 416-18 
heteroskedasticity, 217, 282-83 
linear projection and, 75-76, 113-14 
non-Gaussian disturbances, 209 
time trends and, 454-60 
t test, 204, 205 
Orthogonal, 743 
Orthogonality conditions, 411 
Orthogonalized impulse-response function, 322 
Outer-product estimate, 143 


P 


Partial autocorrelation: 
population, 111-12 
sample, 111-12 

Parzen kernel, 283 

Period, 708 

Periodic, 707 
Markov chain, 685 

Periodogram: 
multivariate, 272-75 
univariate, 158-63 

Permanent income, 440 


Phase, 275, 708 
Phillips-Ouliaris-Hansen tests, 599 
Phillips-Perron tests, 506-14, 762-63 
Phillips triangular representation, 576-78 
Plim, 181, 749 
Polar coordinates, 704-5, 710 
Polynomial in lag operator, 27, 258 
Population: 

canonical correlations, 630-33 

coherence, 275 

correlation, 743 

covariance, 742 

moments, 739-40, 744-45 

spectrum, 61-62, 152-57, 163-67, 269, 

276-77 

Posterior density, 352 
Power series, 714 
Precision, 355 
Predetermined, 238 
Prediction error decomposition, 122, 129, 310 
Present value, 4, 19-20 
Principal diagonal, 721 
Prior distribution, 351 
Probability limit, 181, 749 
pth-order autoregressive process, 58-59 
pth-order difference equations, 7-20, 33-36 
Purchasing power parity. See Exchange rates 


Q 


qth-order moving average, 50-51 

Quadratic equations, 710-11 

Quadratic spectral kernel, 284 

Quadrature spectrum, 271 

Quasi-maximum likelihood estimate, 126, 145, 
430-31 


GLS, 222 

GMM and, 430-31 
Kalman filter and, 389 
standard errors, 145 


R 


Radians, 704 
Random variable, 739 
Random walk, 436. See also Unit root process 
OLS estimation, 486-504 
Rational expectations, 422 
efficient markets hypothesis, 306 
Real interest rate, 376 
Real number, 708 
Recessions, 167~—68, 307-8, 450, 697-98 
Recursive substitution, 1-2 
Reduced form, 245-46, 250-52 
VAR, 327, 329 
Reducible Markov chain, 680 
Regime-switching model8: 
Bayesian estimation, 689 
derivation of equations, 692-93 
description of, 690-91 
EM algorithm, 696 
maximum likelihood, 692, 695-96 
singularity, 689 
smoothed inference and forecasts, 694-95 
Regression. See also Generalized least squares 
(GLS); Generalized method of 
moments (GMM); Ordinary least 
squares (OLS) 
classical assumptions, 202 
time-varying parameters, 400 
Regularity conditions, 427, 698 


Subject Index 797 


Residual sum of squares (RSS), 200 
Ridge regression, 355 

RSS. See Residual sum of squares 
R?, 202 


S 


Sample autocorrelations, 110-11 
Sample canonical correlations, 633-35 
Sample likelihood function, 747 
Sample mean: 
definition of, 741 
variance of, 188, 279-81 
Sample moments, 740-41 
Sample periodogram, 158-63, 272~75 
Scalar, 721 
Score, 427-28 
Seasonality, 167-69 
Second moments, 45, 92-95 
Consistent estimation of, 192~93 
Second-order autoregressive process, 56-58 
Second-order difference equations, 17, 29-33 
Seemingly unrelated regressions, 315 
Serial correlation, 225~27 
Sims-Stock-Watson: 
scaling matrix, 457 
transformation, 464, 518 
Simultaneous equations. See also Two-stage 
least squares 
bias, 233-38, 252-53 
estimation based on the reduced form, 
250-52 ¢ 
full-information maximum likelihood 
estimation, 247-50 
identification, 243-47 
instrumental variables and two-stage least 
squares, 238-43 
nonlinear systems of, 421-22 
overview of, 252-53 
Sine, 704, 706-7 
Singular, 728 
Singularity, 689 
Sinusoidal, 706 
Skew, 746 
Small-sample distribution, 216-17, 516 
Smoothing, Kalman filter and, 394-97 
Spectral analysis: 
population spectrum, 152-57, 163-67, 269 
sample periodogram, 158-63, 272-75 
uses of, 167~72 
Spectral representation theorem, 157 
Spectrum. See also Kernel estimates; 
Periodogram 
coherence, 275 
cospectrum, 271-72 
cross, 270 
estimates of, 163-67, 276-77, 283-85 
frequency zero and, 189, 283 
gain, 275 
low-frequency, 169 
phase, 275 
population, 61-62, 152-57, 163~67, 269, 
276-77 
quadrature, 271 
sample, 158-63, 272-75 
sums of processes and, 172 
transfer function, 278 
vector processes and, 268-78 
Spurious regression, 557-62 
Square summable, 52 
Standard deviation, population, 740 


798 Subject Index 


State equation, 372 
State-space model. See Kalman filter 
State vector, 372 
Stationary/stationarity: 
covariance, 45-46 
difference, 444 
strictly, 46 
trend-stationary, 435 
vector, 258-59 
weakly, 45-46 
Steepest ascent, 134-37 
Stochastic processes: 
central limit theorem for stationary, 195 
composite, 172 
expectations and, 43-45 
Stochastic variable, 739 ' 
Stock prices, 37-38, 306-7, 422-24, 668-69, 
672 
Structural econometric models, vector 
autoregression and, 324~36 
Student’s ¢ distribution. See ¢ distribution 
Summable: 
absolute, 52, 64 
square, 52 
Sums of ARMA processes, 102~8 
autocovariance generating function of, 106 
AR, 107-8 
MA, 102-7 
spectrum of, 172 
Superconsistent, 460 
Sup operator, 481 


T 


Taxes, 361 
Taylor series, 713-14, 737-38 
Taylor theorem, 713, 737-38 
t distribution, 205, 213, 356-57, 409-10, 746, 
755 
Theorems (named after authors): 
Cramér-Wold, 184 
De Moivre, 153, 716-17 
Gauss-Markov, 203, 222 
Granger representation, 582 
Khinchine’s, 183 
Taylor, 713, 737-38 
Three-stage least squares, 250 
Time domain, 152 
Time series operators, 25-26 
Time series process, 43 
Time trends, 25, 435. See also Trend-stationary 
approaches to, 447-50 
asymptotic distribution of, 454-60 
asymptotic inference for autoregressive 
process around, 463-72 
breaks in, 449-50 
hypothesis testing for, 461-63 
linear, 438 
OLS estimation, 463 
Time-varying parameters, Kalman filter and, 
398-403 
Trace, 723 
Transition matrix, 679 
Transposition, 723 
Trends representation (Stock-Watson), 
common, 578 
Trend-stationary, 435 
comparison of umit root process and, 438-44 
forecasts for, 439 
Triangular factorization: 
block, 98-100 


covariance matrix and, 114-15 
description of, 87-91 
maximum likelihood estimation and, 128-29 
of a second-moment matrix and linear 
projection, 92-95 
Triangular representation, 576-78 
Trigonometry, 157, 166, 704~8 
t statistic, 204 
2SLS. See Two-stage least squares 
Two-stage least squares (2SLS): 
asymptotic distribution of, 241-42 
coefficient estimator, 238 
consistency of, 240-41 
general description of, 238-39 
GMM and, 420-21 
instrumental variable estimation, 242~—43 


U 


Unconditional density, 44 

Unconditional mean, 44 

Uncorrelated, 92, 743 

Unidentified, 244 

Uniformly integrable, 191 

Unimodal, 134 

Unit circle, 32, 709 

Unit root process, 435-36. See also 

Cointegration, Dickey-Fuller test 
asymptotic distribution, 475-77, 504-6 
Bayesian analysis, 532-34 
Beveridge-Nelson decomposition, 504, 
545-46 

comparison of trend-stationary and, 438-44 
difference versus not to difference, 651-53 
dynamic multipliers, 442-44 
forecasts for, 439-41 
functional central limit theorem and, 483-86 
Johansen’s test, 646 
meaning of tests for, 444-47, 515-16 
multivariate asymptotic theory, 544, 547 
observational eqnivalence, 444-47, 515-16 
OLS estimation of autoregression, 527 
Phillips-Perron tests, 506-14 
small-sample distribution, 516 
spurious regression, 557-62 
variance ratio test, 531-32 
vector autoregression, 549-57 


V 


VAR. See Vector autoregression 
Variance, 44—45, 740 


decomposition, 323-24 
population, 740 
of sample mean, 188, 279-81 
Variance ratio test, 531-32 
Vech operator, 300-301 
Vec operator, 265 
Vector autoregression. See also Cointegration; 
Impulse-response function 
autocovariances and convergence results for, 
264-66 
autocovariance generating function and, 267 
Bayesian analysis and, 360-62 
cointegration and, 579-80 
impulse-response function and, 318-23 
introduction to, 257-61 
likelihood ratio test, 296-98 
Markov chain and, 679 
maximum likelihood estimation and, 
291-302, 309-18 
restricted, 309-18 
spectrum for, 276 
standard errors, 298, 301, 336-340 
stationarity, 259 
structural econometric models and, 324~36 
time-varying parameters, 401-3 
unit roots, 549-57 
univariate representation, 349 
Vector martingale difference sequence, 189 
Vector processes, asymptotic results for 
nonstationary, 544—48 ; 
Vectors, forecasting, 77 


W 


Wald form, 213, 299 
Wald test, 205, 214 

for maximum likelihood estimation, 429-30 
White noise: 

Gaussian, 25, 43, 48 

independent, 48 

process, 47~48 
Wiener process, 478 
Wiener-Kolmogorov prediction formula, 80 
Wold’s decomposition, 108-9 

Kalman filter and, 391-94 


Y 


Yule-Walker equations, 59 


Subiect Ind x 799 


